关键词: ORFan environmental sequencing shotgun metagenomics viral ecology

来  源:   DOI:10.4056/sigs.2945050   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
One consistent finding among studies using shotgun metagenomics to analyze whole viral communities is that most viral sequences show no significant homology to known sequences. Thus, bioinformatic analyses based on sequence collections such as GenBank nr, which are largely comprised of sequences from known organisms, tend to ignore a majority of sequences within most shotgun viral metagenome libraries. Here we describe a bioinformatic pipeline, the Viral Informatics Resource for Metagenome Exploration (VIROME), that emphasizes the classification of viral metagenome sequences (predicted open-reading frames) based on homology search results against both known and environmental sequences. Functional and taxonomic information is derived from five annotated sequence databases which are linked to the UniRef 100 database. Environmental classifications are obtained from hits against a custom database, MetaGenomes On-Line, which contains 49 million predicted environmental peptides. Each predicted viral metagenomic ORF run through the VIROME pipeline is placed into one of seven ORF classes, thus, every sequence receives a meaningful annotation. Additionally, the pipeline includes quality control measures to remove contaminating and poor quality sequence and assesses the potential amount of cellular DNA contamination in a viral metagenome library by screening for rRNA genes. Access to the VIROME pipeline and analysis results are provided through a web-application interface that is dynamically linked to a relational back-end database. The VIROME web-application interface is designed to allow users flexibility in retrieving sequences (reads, ORFs, predicted peptides) and search results for focused secondary analyses.
摘要:
使用鸟枪宏基因组学分析整个病毒群落的研究中的一个一致发现是,大多数病毒序列与已知序列没有显着同源性。因此,基于序列集合的生物信息学分析,如GenBanknr,它们主要由已知生物体的序列组成,倾向于忽略大多数鸟枪病毒宏基因组文库中的大多数序列。这里我们描述了一个生物信息管道,用于宏基因组探索的病毒信息学资源(VIROME),强调基于针对已知和环境序列的同源性搜索结果对病毒宏基因组序列(预测的开放阅读框)进行分类。功能和分类学信息来源于与UniRef100数据库链接的五个带注释的序列数据库。环境分类是从自定义数据库的点击中获得的,元基因组在线,其中包含4900万个预测的环境肽。通过VIROME管道运行的每个预测的病毒宏基因组ORF被放入七个ORF类别之一,因此,每个序列都会收到一个有意义的注释。此外,管道包括质量控制措施,以去除污染和劣质序列,并通过筛选rRNA基因评估病毒宏基因组文库中细胞DNA污染的潜在量.对VIROME管道和分析结果的访问是通过动态链接到关系后端数据库的Web应用程序界面提供的。VIROMEWeb应用程序界面旨在允许用户灵活地检索序列(读取,ORFs,预测的肽)和搜索结果,以进行集中的二次分析。
公众号