Programming Languages

  • 文章类型: Journal Article
    Haxe is a general purpose, object-oriented programming language supporting syntactic macros. The Haxe compiler is well known for its ability to translate the source code of Haxe programs into the source code of a variety of other programming languages including Java, C++, JavaScript, and Python. Although Haxe is more and more used for a variety of purposes, including games, it has not yet attracted much attention from bioinformaticians. This is surprising, as Haxe allows generating different versions of the same program (e.g. a graphical user interface version in JavaScript running in a web browser for beginners and a command-line version in C++ or Python for increased performance) while maintaining a single code, a feature that should be of interest for many bioinformatic applications. To demonstrate the usefulness of Haxe in bioinformatics, we present here the case story of the program SeqPHASE, written originally in Perl (with a CGI version running on a server) and published in 2010. As Perl+CGI is not desirable anymore for security purposes, we decided to rewrite the SeqPHASE program in Haxe and to host it at Github Pages (, thereby alleviating the need to configure and maintain a dedicated server. Using SeqPHASE as an example, we discuss the advantages and disadvantages of Haxe\'s source code conversion functionality when it comes to implementing bioinformatic software.






  • 文章类型: Journal Article
    One should assume that in silico experiments in systems biology are less susceptible to reproducibility issues than their wet-lab counterparts, because they are free from natural biological variations and their environment can be fully controlled. However, recent studies show that only half of the published mathematical models of biological systems can be reproduced without substantial effort. In this article we examine the potential causes for failed or cumbersome reproductions in a case study of a one-dimensional mathematical model of the atrioventricular node, which took us four months to reproduce. The model demonstrates that even otherwise rigorous studies can be hard to reproduce due to missing information, errors in equations and parameters, a lack in available data files, non-executable code, missing or incomplete experiment protocols, and missing rationales behind equations. Many of these issues seem similar to problems that have been solved in software engineering using techniques such as unit testing, regression tests, continuous integration, version control, archival services, and a thorough modular design with extensive documentation. Applying these techniques, we reimplement the examined model using the modeling language Modelica. The resulting workflow is independent of the model and can be translated to SBML, CellML, and other languages. It guarantees methods reproducibility by executing automated tests in a virtual machine on a server that is physically separated from the development environment. Additionally, it facilitates results reproducibility, because the model is more understandable and because the complete model code, experiment protocols, and simulation data are published and can be accessed in the exact version that was used in this article. We found the additional design and documentation effort well justified, even just considering the immediate benefits during development such as easier and faster debugging, increased understandability of equations, and a reduced requirement for looking up details from the literature.






  • 文章类型: Journal Article
    Programming is one of the most crucial abilities for students in science and technology courses. Few studies on programming ability have considered the effect of students\' construal levels on their learning performance. Therefore, the effects of students\' construal level were explored in this study to fill this research gap and open a new avenue for the improvements in programming ability. The research participants were 110 seventh- and eighth-grade students with basic programming abilities taking an Arduino course. Data were collected from online questionnaires and analyzed using two-way analysis of variance and structural equation modeling to investigate the relationships among construal levels, programming ability, and learning satisfaction. The results revealed that students\' construal levels affect their learning satisfaction and programming ability. These findings indicate that teaching strategies could effectively improve the learning satisfaction and programming ability of junior high school students.







  • 文章类型: Journal Article
    Reproducibility has been shown to be limited in many scientific fields. This question is a fundamental tenet of scientific activity, but the related issues of reusability of scientific data are poorly documented. Here, we present a case study of our difficulties in reproducing a published bioinformatics method even though code and data were available. First, we tried to re-run the analysis with the code and data provided by the authors. Second, we reimplemented the whole method in a Python package to avoid dependency on a MATLAB license and ease the execution of the code on a high-performance computing cluster. Third, we assessed reusability of our reimplementation and the quality of our documentation, testing how easy it would be to start from our implementation to reproduce the results. In a second section, we propose solutions from this case study and other observations to improve reproducibility and research efficiency at the individual and collective levels.While finalizing our code, we created case-specific documentation and tutorials for the associated Python package StratiPy. Readers are invited to experiment with our reproducibility case study by generating the two confusion matrices (see more in section \"Robustness: from MATLAB to Python, language and organization\"). Here, we propose two options: a step-by-step process to follow in a Jupyter/IPython notebook or a Docker container ready to be built and run.







  • 文章类型: Journal Article
    Identifying rare but significant healthcare events in massive unstructured datasets has become a common task in healthcare data analytics. However, imbalanced class distribution in many practical datasets greatly hampers the detection of rare events, as most classification methods implicitly assume an equal occurrence of classes and are designed to maximize the overall classification accuracy. In this study, we develop a framework for learning healthcare data with imbalanced distribution via incorporating different rebalancing strategies. The evaluation results showed that the developed framework can significantly improve the detection accuracy of medical incidents due to look-alike sound-alike (LASA) mix-ups. Specifically, logistic regression combined with the synthetic minority oversampling technique (SMOTE) produces the best detection results, with a significant 45.3% increase in recall (recall = 75.7%) compared with pure logistic regression (recall = 52.1%).







  • 文章类型: Journal Article
    Reusing the data from healthcare information systems can effectively facilitate clinical trials (CTs). How to select candidate patients eligible for CT recruitment criteria is a central task. Related work either depends on DBA (database administrator) to convert the recruitment criteria to native SQL queries or involves the data mapping between a standard ontology/information model and individual data source schema. This paper proposes an alternative computer-aided CT recruitment paradigm, based on syntax translation between different DSLs (domain-specific languages). In this paradigm, the CT recruitment criteria are first formally represented as production rules. The referenced rule variables are all from the underlying database schema. Then the production rule is translated to an intermediate query-oriented DSL (e.g., LINQ). Finally, the intermediate DSL is directly mapped to native database queries (e.g., SQL) automated by ORM (object-relational mapping).






  • 文章类型: Journal Article
    In recent years, RNA-seq has become an important method in the process of measuring gene expression in various cells and organisms. This chapter will detail all the bioinformatic steps that should be undertaken to determine differentially expressed genes from a typical RNA-seq experiment. Each step will be clearly explained in \"non-bioinformatic\" terminology so that readers embarking on RNA-seq analysis will be able to understand the rationale and reasoning behind each step. Moreover, the exact command lines used to process the data will be presented along with a description of the various flags and commands.






  • 文章类型: Journal Article
    Geographic Profiling technique is used to find the origin of a series of crimes. The method was recently extended to other fields. One of the best renowned data in epidemiology is that by John Snow during an outburst of cholera in London. We wrote Python scripts to perform the analyses to apply the Geographic Profiling for individuating the starting origin of an infection by using the old Snow\'s data set. We modified the method by applying a weight to each point of the map where cases of cholera were reported. The weight was proportional to the number of cases in a given location. This modification of the Geographic Profiling method allowed to individuate in the map an area of maximum probability of the infection source, which was a few meters wide and including the historically known source of cholera, that is the \"classical\" water pump at Broad Street. The method appears to be a useful complement in order to individuate the source of epidemics when available data about the cases of the infections can be summarized on a map.






  • 文章类型: Journal Article
    This study examined ways to improve the accuracy of translating clinical practice guidelines (CPGs) into a computer-interpretable guideline (CIG) for pressure-ulcer management using the Shareable Active Guideline Environment (SAGE) guideline model, and aimed to verify the accuracy of the obtained CIG. The study was conducted using the following procedures: selecting CPGs, extracting rules from the selected CPGs, developing a CIG using the SAGE guideline model, and verifying the obtained CIG with test cases using an execution engine. The CIG for pressure-ulcer management was developed based on 38 rules and three algorithms at the semiformal representation level using MS Excel and MS Visio. The CIG was encoded by two Activity Graphs consisting of 115 instances representing algorithms and rules as knowledge elements in the SAGE guideline model. Two errors were found and corrected. Results of the study demonstrated that a CIG representing knowledge on pressure-ulcer management can be effectively developed using commonly available programs and the SAGE guideline model, and that the obtained CIG can be verified with a locally developed execution engine. The CIG developed in the study could contribute to health information management once it is implemented successfully in a clinical decision support system.






  • 文章类型: Journal Article
    RegulonDB is a database storing the biological information behind the transcriptional regulatory network (TRN) of the bacterium Escherichia coli. It is one of the key bioinformatics resources for Systems Biology investigations of bacterial gene regulation. Like most biological databases, the content drifts with time, both due to the accumulation of new information and due to refinements in the underlying biological concepts. Conclusions based on previous database versions may no longer hold. Here, we study the change of some topological properties of the TRN of E. coli, as provided by RegulonDB across 16 versions, as well as a simple index, digital control strength, quantifying the match between gene expression profiles and the transcriptional regulatory networks. While many of network characteristics change dramatically across the different versions, the digital control strength remains rather robust and in tune with previous results for this index. Our study shows that: (i) results derived from network topology should, when possible, be studied across a range of database versions, before detailed biological conclusions are derived, and (ii) resorting to simple indices, when interpreting high-throughput data from a network perspective, may help achieving a robustness of the findings against variation of the underlying biological information. Database URL:






