Programming Languages

编程语言
  • 文章类型: Journal Article
    语言模型在人工智能(AI)和计算生物学的许多领域中发挥着越来越重要的作用。在这份底漆中,我们讨论语言模型的方式,基于自然语言的和基于生物序列的,可应用于生物学研究。该入门课程主要面向有兴趣在其应用中使用这些尖端AI技术的生物学家。我们为适应生物学的语言模型提供最佳实践和关键资源的指导。
    Language models are playing an increasingly important role in many areas of artificial intelligence (AI) and computational biology. In this primer, we discuss the ways in which language models, both those based on natural language and those based on biological sequences, can be applied to biological research. This primer is primarily intended for biologists interested in using these cutting-edge AI technologies in their applications. We provide guidance on best practices and key resources for adapting language models for biology.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:化学反应网络(CRN)在系统生物学等不同领域发挥着关键作用,生物化学,化学工程,和流行病学。CRN的高级定义可以使用各种模拟方法,包括确定性和随机性方法,从相同的模型。然而,用于模拟CRN的现有Python工具通常包装外部C/C++库以进行模型定义,转换成方程和/或数值求解它们,限制了它们的可扩展性和与更广泛的Python生态系统的集成。
    结果:作为回应,我们开发了庞加莱和SimBio,两个新颖的Python包,用于模拟动态系统和CRN。庞加莱是动力系统建模的基础,虽然SimBio将此功能扩展到CRN,包括对系统生物学标记语言(SBML)的支持。Poincaré和SimBio是作为纯Python软件包开发的,使用户能够通过编写新的或利用其他Python软件包来轻松扩展其模拟功能。此外,这不会影响性能,因为代码可以用Numba及时(JIT)编译。我们使用BioModels存储库中的精选模型进行的基准测试表明,与其他现有工具相比,这些工具可能提供潜在的卓越性能优势。此外,为了确保用户友好的体验,我们的软件包使用标准类型的现代Python语法,提供与集成开发环境(IDE)的无缝集成。我们以Python为中心的方法显着增强了代码分析,错误检测,和重构能力,将Poincaré和SimBio定位为建模社区的有价值的工具。
    背景:Poincaré和SimBio是在MIT许可下发布的。它们的源代码可在GitHub上找到(https://github.com/maurosilber/pointcare和https://github.com/hgrecco/simbio)。并且可以从PyPI或conda-forge安装。
    BACKGROUND: Chemical reaction networks (CRNs) play a pivotal role in diverse fields such as systems biology, biochemistry, chemical engineering, and epidemiology. High-level definitions of CRNs enables to use various simulation approaches, including deterministic and stochastic methods, from the same model. However, existing Python tools for simulation of CRN typically wrap external C/C++ libraries for model definition, translation into equations and/or numerically solving them, limiting their extensibility and integration with the broader Python ecosystem.
    RESULTS: In response, we developed Poincaré and SimBio, two novel Python packages for simulation of dynamical systems and CRNs. Poincaré serves as a foundation for dynamical systems modeling, while SimBio extends this functionality to CRNs, including support for the Systems Biology Markup Language (SBML). Poincaré and SimBio are developed as pure Python packages enabling users to easily extend their simulation capabilities by writing new or leveraging other Python packages. Moreover, this does not compromise the performance, as code can be just-in-time compiled with Numba. Our benchmark tests using curated models from the BioModels repository demonstrate that these tools may provide a potentially superior performance advantage compared to other existing tools. In addition, to ensure a user-friendly experience, our packages use standard typed modern Python syntax that provides a seamless integration with integrated development environments. Our Python-centric approach significantly enhances code analysis, error detection, and refactoring capabilities, positioning Poincaré and SimBio as valuable tools for the modeling community.
    METHODS: Poincaré and SimBio are released under the MIT license. Their source code is available on GitHub (https://github.com/maurosilber/poincare and https://github.com/hgrecco/simbio) and can be installed from PyPI or conda-forge.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这个合成生物学时代,研究人员经常需要合成感兴趣的基因。通过PCR组装多个DNA片段进行基因合成是一种广泛应用的快速经济的方法。到目前为止,已经有一些软件解决方案来设计基因合成中的片段。然而,其中一些软件解决方案使用现在不流行的编程语言,其他软件产品是商业的或需要用户访问服务器。在这项研究中,我们提出了一个Python程序来设计基因合成的DNA片段。该算法设计满足实验需要。此外,带有详细注释的源代码可供所有用户免费使用。此外,实验验证了算法和程序的可行性。我们的程序可以用于实验室中基因合成的设计,并有助于基因结构和功能的研究。
    Researchers often need to synthesize genes of interest in this era of synthetic biology. Gene synthesis by PCR assembly of multiple DNA fragments is a quick and economical method that is widely applied. Up to now, there have been a few software solutions for designing fragments in gene synthesis. However, some of these software solutions use programming languages that are not popular now, other software products are commercial or require users to visit servers. In this study, we propose a Python program to design DNA fragments for gene synthesis. The algorithm is designed to meet the experimental needs. Also, the source code with detailed annotation is freely available for all users. Furthermore, the feasibility of the algorithm and the program is validated by experiments. Our program can be useful for the design of gene synthesis in the labs and help the study of gene structure and function.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对数据分析中高效计算的需求增加,鼓励生物医学科学研究人员使用工作流系统。工作流系统,或者所谓的工作流语言,用于描述和执行一组数据分析步骤。工作流系统提高了研究人员的生产力,特别是在使用高通量DNA测序应用的领域,其中需要可扩展计算。由于系统提高了数据分析工作流程的可移植性,研究社区能够共享工作流程,以降低构建普通分析程序的成本。然而,在一个研究领域拥有多个工作流系统导致了不同工作流系统社区的努力分布。由于每个工作流系统都有其独特的特点,为了使用公开共享的工作流,学习每一个系统是不可行的。因此,我们开发了札幌,一种应用程序,用于根据各种工作流系统的差异提供统一的工作流执行层。札幌有两个组件:接收工作流运行请求的应用程序编程接口(API)和基于浏览器的API客户端。该API遵循全球基因组学和健康联盟提出的工作流执行服务API标准。当前实现支持以四种语言执行工作流:通用工作流语言、工作流描述语言,蛇饼,和Nextflow。凭借其可扩展和可扩展的设计,札幌可以支持研究社区利用宝贵的资源进行数据分析。
    The increased demand for efficient computation in data analysis encourages researchers in biomedical science to use workflow systems. Workflow systems, or so-called workflow languages, are used for the description and execution of a set of data analysis steps. Workflow systems increase the productivity of researchers, specifically in fields that use high-throughput DNA sequencing applications, where scalable computation is required. As systems have improved the portability of data analysis workflows, research communities are able to share workflows to reduce the cost of building ordinary analysis procedures. However, having multiple workflow systems in a research field has resulted in the distribution of efforts across different workflow system communities. As each workflow system has its unique characteristics, it is not feasible to learn every single system in order to use publicly shared workflows. Thus, we developed Sapporo, an application to provide a unified layer of workflow execution upon the differences of various workflow systems. Sapporo has two components: an application programming interface (API) that receives the request of a workflow run and a browser-based client for the API. The API follows the Workflow Execution Service API standard proposed by the Global Alliance for Genomics and Health. The current implementation supports the execution of workflows in four languages: Common Workflow Language, Workflow Description Language, Snakemake, and Nextflow. With its extensible and scalable design, Sapporo can support the research community in utilizing valuable resources for data analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本手稿描述了一个资源模块的开发,该模块是名为“基于云的学习的NIGMSSandbox”的学习平台的一部分,https://github.com/NIGMS/NIGMS-Sandbox。沙箱的整体起源在本补编开始时由美国国家普通医学科学研究所撰写的社论中描述:NIGMS沙箱:面向生物医学研究的民主化云计算的学习平台。该模块以交互式格式提供学习材料,介绍BASH(BourneAgainShell)编程语言用于基因组数据分析的实用性,该语言使用适当的云资源进行数据访问和分析。下一代测序革命已经从众多平台中产生了大量新的生物学数据,这些平台调查了不断增长的基因组模式列表。这些数据需要大量的下游计算和统计分析来收集有意义的生物学见解。然而,生成这些数据所需的技能集与分析这些数据所需的技能大不相同。生成下一代数据的科学家通常缺乏对这些数据集进行分析所需的培训,并且需要生物信息学专家的支持。需要专门的计算培训,以增强生物学家在基因组数据分析领域的能力,然而,学习如何有效地利用命令行界面是学习如何利用常用分析工具的一个重要障碍。云平台有可能使使用现代测序数据所需的技术工具和计算资源的访问民主化。为生物信息学教育提供有效的框架。该模块旨在提供一个交互式平台,该平台可以缓慢地构建与云中的命令行上的基因组学数据进行交互所需的技术技能和知识。该模块的沙箱格式使用户能够按照自己的步调浏览材料,并在下一个子模块中对该材料进行构建之前,通过知识自我检查来测试他们对材料的掌握。本手稿描述了资源模块的开发,该模块是名为“NIGMSSandboxforCloud-basedLearning\'\'https://github.com/NIGMS/NIGMS-Sandbox”的学习平台的一部分。沙箱的整体起源在本补编开头的社论NIGMS沙箱[1]中进行了描述。该模块以交互式格式提供有关批量和单细胞ATAC-seq数据分析的学习材料,该格式使用适当的云资源进行数据访问和分析。
    This manuscript describes the development of a resource module that is part of a learning platform named \'NIGMS Sandbox for Cloud-based Learning\', https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial authored by National Institute of General Medical Sciences: NIGMS Sandbox: A Learning Platform toward Democratizing Cloud Computing for Biomedical Research at the beginning of this supplement. This module delivers learning materials introducing the utility of the BASH (Bourne Again Shell) programming language for genomic data analysis in an interactive format that uses appropriate cloud resources for data access and analyses. The next-generation sequencing revolution has generated massive amounts of novel biological data from a multitude of platforms that survey an ever-growing list of genomic modalities. These data require significant downstream computational and statistical analyses to glean meaningful biological insights. However, the skill sets required to generate these data are vastly different from the skills required to analyze these data. Bench scientists that generate next-generation data often lack the training required to perform analysis of these datasets and require support from bioinformatics specialists. Dedicated computational training is required to empower biologists in the area of genomic data analysis, however, learning to efficiently leverage a command line interface is a significant barrier in learning how to leverage common analytical tools. Cloud platforms have the potential to democratize access to the technical tools and computational resources necessary to work with modern sequencing data, providing an effective framework for bioinformatics education. This module aims to provide an interactive platform that slowly builds technical skills and knowledge needed to interact with genomics data on the command line in the Cloud. The sandbox format of this module enables users to move through the material at their own pace and test their grasp of the material with knowledge self-checks before building on that material in the next sub-module. This manuscript describes the development of a resource module that is part of a learning platform named ``NIGMS Sandbox for Cloud-based Learning\'\' https://github.com/NIGMS/NIGMS-Sandbox. The overall genesis of the Sandbox is described in the editorial NIGMS Sandbox [1] at the beginning of this Supplement. This module delivers learning materials on the analysis of bulk and single-cell ATAC-seq data in an interactive format that uses appropriate cloud resources for data access and analyses.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生物信息学工具对于在组学科学中进行分析至关重要。鉴于组学领域的进步带来了许多实验机会,并且更容易获得高通量测序平台,这些工具在研究项目中发挥着重要作用。尽管生物信息学工具的发展取得了相当大的进展,一些工具是针对特定的分析目标而定制的,导致非生物信息学家面临挑战,他们需要将这些特定工具的结果整合到定制的管道中。为了解决这个问题,我们开发了BioPipelineCreator,一个用户友好的基于Java的GUI,允许不同的软件工具集成到曲目中,同时通过可访问的图形界面确保用户轻松交互。由客户端和服务器软件组件组成,BioPipelineCreator提供了一个直观的图形界面,简化了各种生物信息学工具的使用,为用户没有先进的计算机技能。它可以在不太复杂的设备或工作站上运行,允许用户保留他们的操作系统,而不必切换到另一个兼容的系统。服务器负责处理任务,可以在用户的本地或远程网络结构中执行分析。兼容最重要的操作系统,可访问https://github.com/allanverasce/bpc。git.
    Bioinformatics tools are essential for performing analyses in the omics sciences. Given the numerous experimental opportunities arising from advances in the field of omics and easier access to high-throughput sequencing platforms, these tools play a fundamental role in research projects. Despite the considerable progress made possible by the development of bioinformatics tools, some tools are tailored to specific analytical goals, leading to challenges for non-bioinformaticians who need to integrate the results of these specific tools into a customized pipeline. To solve this problem, we have developed the BioPipeline Creator, a user-friendly Java-based GUI that allows different software tools to be integrated into the repertoire while ensuring easy user interaction via an accessible graphical interface. Consisting of client and server software components, BioPipeline Creator provides an intuitive graphical interface that simplifies the use of various bioinformatics tools for users without advanced computer skills. It can run on less sophisticated devices or workstations, allowing users to keep their operating system without having to switch to another compatible system. The server is responsible for the processing tasks and can perform the analysis in the user\'s local or remote network structure. Compatible with the most important operating systems, available at https://github.com/allanverasce/bpc.git .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    结论:贝叶斯推理方法的开发者和用户之间的有效合作是推进我们对生物系统定量理解的关键。我们在这里介绍Hopsy,一个通用的开源平台,旨在方便地访问强大的马尔可夫链蒙特卡洛采样算法,该算法针对在凸多面体(CP)上定义的模型而定制。基于高性能C++采样库HOPS,hopsy继承了其优势,并通过Python编程语言的可访问性扩展了其功能。通用的插件机制可实现与特定领域模型的无缝集成,为方法开发人员提供测试框架,基准测试,并分发CP采样器以接近现实世界的推理任务。我们通过解决常见和新组成的特定领域采样问题来展示希望,突出重要的设计选择。把Hopsy比作市场,我们强调它在将用户和开发人员聚集在一起方面的作用,用户可以访问最先进的方法,和开发人员为挑战特定领域的推理问题贡献自己的创新解决方案。
    方法:来源,文档和不断更新的采样算法列表可在https://jugit。FZ-Juelich.de/IBG-1/ModSim/hopsy,用Linux,Windows和MacOS二进制文件位于https://pypi.org/project/hopsy/。
    背景:补充数据可在Bioinformatics在线获得。
    CONCLUSIONS: Effective collaboration between developers of Bayesian inference methods and users is key to advance our quantitative understanding of biosystems. We here present hopsy, a versatile open-source platform designed to provide convenient access to powerful Markov chain Monte Carlo sampling algorithms tailored to models defined on convex polytopes (CP). Based on the high-performance C++ sampling library HOPS, hopsy inherits its strengths and extends its functionalities with the accessibility of the Python programming language. A versatile plugin-mechanism enables seamless integration with domain-specific models, providing method developers with a framework for testing, benchmarking, and distributing CP samplers to approach real-world inference tasks. We showcase hopsy by solving common and newly composed domain-specific sampling problems, highlighting important design choices. By likening hopsy to a marketplace, we emphasize its role in bringing together users and developers, where users get access to state-of-the-art methods, and developers contribute their own innovative solutions for challenging domain-specific inference problems.
    METHODS: Sources, documentation and a continuously updated list of sampling algorithms are available at https://jugit.fz-juelich.de/IBG-1/ModSim/hopsy, with Linux, Windows and MacOS binaries at https://pypi.org/project/hopsy/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    结论:预训练的大型语言模型(LLM)具有显着改善的代码生成。随着这些模型的扩大,越来越需要输出来处理更复杂的任务,并适当地专门针对特定领域。这里,由于领域知识的数量,我们以生物信息学为目标,算法,和数据操作这个学科需要。我们介绍BioCoder,为评估LLM生成生物信息学特定代码而开发的基准。BioCoder跨越了大部分领域,涵盖跨文件依赖关系,类声明,和全局变量。它包含从GitHub提取的1026个Python函数和1243个Java方法,还有罗莎琳德项目的253个例子,都与生物信息学有关。使用主题建模,我们表明,所包含代码的总体覆盖范围代表了生物信息学计算的全部范围。BioCoder采用模糊测试框架进行评估。我们已经应用它来评估各种模型,包括InCoder,CodeGen,CodeGen2,SantaCoder,StarCoder,StarCoder+,InstructCodeT5+,GPT-3.5和GPT-4。此外,我们微调了一个模型(StarCoder),证明我们的训练数据集可以提高我们的测试基准的性能(在某些提示配置下,以Pass@K表示>15%,并且始终>3%)。结果突出了成功模型的两个关键方面:(i)成功模型在完整上下文中容纳长提示(>2600个令牌),包括功能依赖。(ii)它们包含特定领域的生物信息学知识,不仅仅是一般的编码能力。这从GPT-3.5/4的性能增益中可以明显看出,与我们基准上的较小型号相比(50%与高达25%)。
    方法:所有数据集,基准,Docker映像,测试所需的脚本可在以下网址获得:https://github.com/gersteinlab/biocoder和https://biocoder-benchmark。github.io/。
    CONCLUSIONS: Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%).
    METHODS: All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    质谱是分析复杂生物样品中分子的强大技术。然而,由于各种因素,流产间和流产内的变异性和偏倚会影响数据,包括样品处理和制备,仪器校准和性能,以及数据采集和处理。为了解决这个问题,人类蛋白质组组织蛋白质组学标准计划的质量控制(QC)工作组已经建立了标准的mzQC文件格式,用于报告和交换与数据质量有关的信息.mzQC基于JavaScript对象表示法(JSON)格式,提供了一种轻量级但通用的文件格式,可以在软件中轻松实现。这里,我们提供了开源软件库,以三种编程语言处理mzQC数据:Python,使用pymzqc;R,使用rmzqc;和Java,使用jmzqc。这些库遵循通用数据模型并提供共享功能,包括mzQC文件的(反)序列化和验证。我们演示了在工作流程中使用软件库进行提取,分析,并从不同的来源可视化质量控制指标。此外,我们展示了这些库如何相互集成,使用现有的软件工具,以及质谱数据QC的自动化工作流程。所有软件库都可以在GitHub上的MS-Quality-Hub组织(https://github.com/MS-Quality-Hub)下作为开源提供。
    Mass spectrometry is a powerful technique for analyzing molecules in complex biological samples. However, inter- and intralaboratory variability and bias can affect the data due to various factors, including sample handling and preparation, instrument calibration and performance, and data acquisition and processing. To address this issue, the Quality Control (QC) working group of the Human Proteome Organization\'s Proteomics Standards Initiative has established the standard mzQC file format for reporting and exchanging information relating to data quality. mzQC is based on the JavaScript Object Notation (JSON) format and provides a lightweight yet versatile file format that can be easily implemented in software. Here, we present open-source software libraries to process mzQC data in three programming languages: Python, using pymzqc; R, using rmzqc; and Java, using jmzqc. The libraries follow a common data model and provide shared functionalities, including the (de)serialization and validation of mzQC files. We demonstrate use of the software libraries in a workflow for extracting, analyzing, and visualizing QC metrics from different sources. Additionally, we show how these libraries can be integrated with each other, with existing software tools, and in automated workflows for the QC of mass spectrometry data. All software libraries are available as open source under the MS-Quality-Hub organization on GitHub (https://github.com/MS-Quality-Hub).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    交互式JupyterNotebooks结合Conda环境可用于生成FAIR(Findable,可访问,可互操作和可重用/可复制)生物分子模拟工作流程。带有文档的交互式编程代码以及使用通用图形图表和数据可视化检查中间结果的可能性非常有用,尤其是在迭代过程中,其中参数可能会根据感兴趣的特定系统进行调整。这项工作提供了涵盖生物分子模拟领域各个领域的FAIR笔记本集合,如分子动力学(MD),蛋白质-配体对接,分子检查/建模,分子相互作用,和自由能扰动。工作流可以使用myBinder启动,也可以轻松安装在本地系统中。本集旨在提供示范工作流程的汇编,并使用新的方法和工具不断更新和扩展示例。
    Interactive Jupyter Notebooks in combination with Conda environments can be used to generate FAIR (Findable, Accessible, Interoperable and Reusable/Reproducible) biomolecular simulation workflows. The interactive programming code accompanied by documentation and the possibility to inspect intermediate results with versatile graphical charts and data visualization is very helpful, especially in iterative processes, where parameters might be adjusted to a particular system of interest. This work presents a collection of FAIR notebooks covering various areas of the biomolecular simulation field, such as molecular dynamics (MD), protein-ligand docking, molecular checking/modeling, molecular interactions, and free energy perturbations. Workflows can be launched with myBinder or easily installed in a local system. The collection of notebooks aims to provide a compilation of demonstration workflows, and it is continuously updated and expanded with examples using new methodologies and tools.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号