关键词: Fine-tuning Genomic sequences Genotype-phenotype HERV Motif

Mesh : Humans Phenotype Genomics / methods Genome, Human Models, Genetic Endogenous Retroviruses / genetics Deep Learning Genotype

来  源:   DOI:10.1186/s12967-024-05567-z   PDF(Pubmed)

Abstract:
BACKGROUND: Decoding human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers have studied the genotype-phenotype relationship and generate important datasets that help unravel complicated genetic blueprints. Thus, the recently developed artificial intelligence methods can be used to interpret the functions of those DNA sequences.
METHODS: This study explores the use of deep learning, particularly pre-trained genomic models like DNA_bert_6 and human_gpt2-v1, in interpreting and representing human genome sequences. Initially, we meticulously constructed multiple datasets linking genotypes and phenotypes to fine-tune those models for precise DNA sequence classification. Additionally, we evaluate the influence of sequence length on classification results and analyze the impact of feature extraction in the hidden layers of our model using the HERV dataset. To enhance our understanding of phenotype-specific patterns recognized by the model, we perform enrichment, pathogenicity and conservation analyzes of specific motifs in the human endogenous retrovirus (HERV) sequence with high average local representation weight (ALRW) scores.
RESULTS: We have constructed multiple genotype-phenotype datasets displaying commendable classification performance in comparison with random genomic sequences, particularly in the HERV dataset, which achieved binary and multi-classification accuracies and F1 values exceeding 0.935 and 0.888, respectively. Notably, the fine-tuning of the HERV dataset not only improved our ability to identify and distinguish diverse information types within DNA sequences but also successfully identified specific motifs associated with neurological disorders and cancers in regions with high ALRW scores. Subsequent analysis of these motifs shed light on the adaptive responses of species to environmental pressures and their co-evolution with pathogens.
CONCLUSIONS: These findings highlight the potential of pre-trained genomic models in learning DNA sequence representations, particularly when utilizing the HERV dataset, and provide valuable insights for future research endeavors. This study represents an innovative strategy that combines pre-trained genomic model representations with classical methods for analyzing the functionality of genome sequences, thereby promoting cross-fertilization between genomics and artificial intelligence.
摘要:
背景:解码人类基因组序列需要对DNA序列功能性进行全面分析。通过计算和实验方法,研究人员已经研究了基因型与表型的关系,并生成了有助于解开复杂遗传蓝图的重要数据集。因此,最近开发的人工智能方法可以用来解释这些DNA序列的功能。
方法:本研究探讨了深度学习的使用,特别是预训练的基因组模型,如DNA_bert_6和human_gpt2-v1,在解释和表示人类基因组序列。最初,我们精心构建了多个连接基因型和表型的数据集,以微调这些模型,从而实现精确的DNA序列分类.此外,我们评估了序列长度对分类结果的影响,并使用HERV数据集分析了模型隐藏层中特征提取的影响.为了增强我们对模型识别的表型特异性模式的理解,我们进行浓缩,具有高平均局部代表权重(ALRW)评分的人内源性逆转录病毒(HERV)序列中特定基序的致病性和保守性分析。
结果:我们构建了多个基因型-表型数据集,与随机基因组序列相比,这些数据集显示出值得称道的分类性能,特别是在HERV数据集中,实现了二进制和多分类精度,F1值分别超过0.935和0.888。值得注意的是,HERV数据集的微调不仅提高了我们识别和区分DNA序列中不同信息类型的能力,而且还成功地在ALRW评分较高的区域中识别出与神经系统疾病和癌症相关的特定基序.随后对这些基序的分析揭示了物种对环境压力的适应性反应及其与病原体的共同进化。
结论:这些发现突出了预先训练的基因组模型在学习DNA序列表征方面的潜力。特别是在利用HERV数据集时,并为未来的研究工作提供有价值的见解。这项研究代表了一种创新的策略,将预先训练的基因组模型表示与分析基因组序列功能的经典方法相结合。从而促进基因组学和人工智能之间的交叉受精。
公众号