Mesh : Genome, Plant Plants, Edible / genetics Genomics / methods Deep Learning Manihot / genetics

来  源:   DOI:10.1038/s42003-024-06465-2   PDF(Pubmed)

Abstract:
Significant progress has been made in the field of plant genomics, as demonstrated by the increased use of high-throughput methodologies that enable the characterization of multiple genome-wide molecular phenotypes. These findings have provided valuable insights into plant traits and their underlying genetic mechanisms, particularly in model plant species. Nonetheless, effectively leveraging them to make accurate predictions represents a critical step in crop genomic improvement. We present AgroNT, a foundational large language model trained on genomes from 48 plant species with a predominant focus on crop species. We show that AgroNT can obtain state-of-the-art predictions for regulatory annotations, promoter/terminator strength, tissue-specific gene expression, and prioritize functional variants. We conduct a large-scale in silico saturation mutagenesis analysis on cassava to evaluate the regulatory impact of over 10 million mutations and provide their predicted effects as a resource for variant characterization. Finally, we propose the use of the diverse datasets compiled here as the Plants Genomic Benchmark (PGB), providing a comprehensive benchmark for deep learning-based methods in plant genomic research. The pre-trained AgroNT model is publicly available on HuggingFace at https://huggingface.co/InstaDeepAI/agro-nucleotide-transformer-1b  for future research purposes.
摘要:
植物基因组学领域取得了重大进展,如高通量方法的使用增加所证明的,这些方法能够表征多种全基因组分子表型。这些发现为植物性状及其潜在的遗传机制提供了有价值的见解,特别是在模型植物物种中。尽管如此,有效地利用它们来做出准确的预测是作物基因组改良的关键一步。我们介绍AgroNT,一个基本的大型语言模型,在48种植物的基因组上训练,主要集中在作物物种上。我们证明AgroNT可以获得最新的调控注释预测,促进剂/终止子强度,组织特异性基因表达,并优先考虑功能变体。我们对木薯进行了大规模的硅饱和诱变分析,以评估超过1000万个突变的调节影响,并提供其预测的效果作为变体表征的资源。最后,我们建议使用这里汇编的不同数据集作为植物基因组基准(PGB),为植物基因组研究中基于深度学习的方法提供全面的基准。预训练的AgroNT模型可在HuggingFace上公开获得,网址为https://huggingface。co/InstaDeepAI/农业核苷酸变压器1b,用于未来研究目的。
公众号