关键词: RNA splicing language model self-supervised learning

Mesh : Animals Humans Base Sequence RNA Splicing Vertebrates / genetics RNA Supervised Machine Learning

来  源:   DOI:10.1093/bib/bbae163   PDF(Pubmed)

Abstract:
Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.
摘要:
通过自监督学习(SSL)预训练的语言模型已被广泛用于研究蛋白质序列,而针对基因组序列开发的模型很少,并且仅限于单个物种。由于缺乏来自不同物种的基因组,这些模型不能有效地利用进化信息。在这项研究中,我们开发了SpliceBERT,通过掩蔽语言建模对来自72种脊椎动物的初级核糖核酸(RNA)序列进行预训练的语言模型,并将其应用于基于序列的RNA剪接建模。对不同物种进行预训练的SpliceBERT可以有效地鉴定进化上保守的元素。同时,学习到的隐藏状态和注意力权重可以表征剪接位点的生物学特性。因此,SpliceBERT在几个下游任务中显示出有效的效果:对剪接的变异效应的零射预测,预测人类的分支点,以及剪接位点的跨物种预测。我们的研究强调了在不同物种上预先训练基因组语言模型的重要性,并表明SSL是一种有前途的方法,可以增强我们对基因组序列调控逻辑的理解。
公众号