关键词: Homology search MARS RNA sequence database RNAcmap3 Secondary structure

Mesh : Databases, Nucleic Acid Sequence Alignment RNA, Untranslated / genetics chemistry Sequence Analysis, RNA / methods RNA / genetics chemistry Software Databases, Genetic

来  源:   DOI:10.1093/gpbjnl/qzae018

Abstract:
Recent success of AlphaFold2 in protein structure prediction relied heavily on co-evolutionary information derived from homologous protein sequences found in the huge, integrated database of protein sequences (Big Fantastic Database). In contrast, the existing nucleotide databases were not consolidated to facilitate wider and deeper homology search. Here, we built a comprehensive database by incorporating the non-coding RNA (ncRNA) sequences from RNAcentral, the transcriptome assembly and metagenome assembly from metagenomics RAST (MG-RAST), the genomic sequences from Genome Warehouse (GWH), and the genomic sequences from MGnify, in addition to the nucleotide (nt) database and its subsets in National Center of Biotechnology Information (NCBI). The resulting Master database of All possible RNA sequences (MARS) is 20-fold larger than NCBI\'s nt database or 60-fold larger than RNAcentral. The new dataset along with a new split-search strategy allows a substantial improvement in homology search over existing state-of-the-art techniques. It also yields more accurate and more sensitive multiple sequence alignments (MSAs) than manually curated MSAs from Rfam for the majority of structured RNAs mapped to Rfam. The results indicate that MARS coupled with the fully automatic homology search tool RNAcmap will be useful for improved structural and functional inference of ncRNAs and RNA language models based on MSAs. MARS is accessible at https://ngdc.cncb.ac.cn/omix/release/OMIX003037, and RNAcmap3 is accessible at http://zhouyq-lab.szbl.ac.cn/download/.
摘要:
最近AlphaFold2在蛋白质结构预测中的成功很大程度上依赖于来自巨大的同源蛋白质序列的共同进化信息,蛋白质序列的综合数据库(大奇幻数据库)。相比之下,现有的核苷酸数据库没有合并以促进更广泛和更深入的同源性搜索.这里,我们通过整合来自RNAcentral的非编码RNA(ncRNA)序列建立了一个全面的数据库,来自宏基因组学RAST(MG-RAST)的转录组组装和宏基因组组装,基因组仓库(GWH)的基因组序列,和MGnify的基因组序列,除了核苷酸(nt)数据库及其子集在国家生物技术信息中心(NCBI)。所得的所有可能RNA序列的主数据库(MARS)比NCBI的nt数据库大20倍或比RNAcentral大60倍。与现有的最新技术相比,新的数据集以及新的拆分搜索策略可以大大改善同源性搜索。对于映射到Rfam的大多数结构化RNA,它也比来自Rfam的手动管理MSA产生更准确和更敏感的多序列比对(MSA)。结果表明,MARS与全自动同源性搜索工具RNAcmap相结合将有助于改善基于MSA的ncRNAs和RNA语言模型的结构和功能推断。MARS可以在https://ngdc访问。cncb.AC.cn/omix/release/OMIX003037和RNAcmap3可在http://zhouyq-lab访问。szbl.AC.cn/download/.
公众号