Sequence quality

  • 文章类型: Journal Article
    本章介绍了使用DNA序列数据获取和比较使用公共数据库GenBank和BarcodeofLifeDataSystem(BOLD)进行分类鉴定的程序。本章首先描述了用于准备上传到GenBank和BOLD的质量序列的程序。接下来,使用GenBankBLAST和BOLD识别引擎描述了用于针对公共数据库查询DNA序列的步骤。提出了分类识别分配的解释指南。最后,提供了用于评估来自GenBank和BOLD的序列的准确性和可靠性的程序。
    This chapter describes procedures for the use of DNA sequence data to obtain and compare taxonomic identification using the public databases GenBank and Barcode of Life Data System (BOLD). The chapter begins by describing procedures used to prepare quality sequences for uploading into GenBank and BOLD. Next, steps used to query the DNA sequences against the public databases are described using GenBank BLAST and BOLD identification engines. Interpretation guidelines for the taxonomic identification assignments are presented. Finally, a procedure for evaluating the accuracy and reliability of sequences from GenBank and BOLD is provided.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:高通量DNA/RNA测序彻底改变了生物学和临床研究。测序被广泛使用,并产生大量的数据,主要是由于降低成本和先进的技术。快速评估giga-tera碱基水平的测序数据的质量已成为常规但重要的任务。识别和消除低质量序列数据对于下游分析结果的可靠性至关重要。需要一种高速工具,其使用优化的并行编程进行批处理,并且独立于任何其他处理步骤简单地测量来自多个数据集的测序数据的质量。
    结果:FQStat是一个独立的,独立于平台的软件工具,使用并行编程评估FASTQ文件的质量。基于机器架构和输入数据,FQStat自动确定每个文件要分配的内核数量和内存量,以获得最佳性能。我们的结果表明,在核心有限的情况下,核心分配开销超过了额外核心的好处。在核心无限的情况下,通过为每个文件分配越来越多的内核,性能达到了饱和点。我们还表明,与内核分配相比,每个文件的内存分配在性能上具有较低的优先级。FQStat的输出在HTML网页中汇总,制表符分隔的文本文件,和高分辨率图像格式。FQStat计算并绘制读取计数,读取长度,质量评分,和高质量的基础统计数据。FQStat识别并标记低质量测序数据以建议从下游分析中移除。我们将FQStat应用于真实测序数据以优化性能并展示其能力。我们还将FQStat的性能与类似的质量控制(QC)工具进行了比较,这些工具利用并行编程并在运行时间上获得了改进。
    结论:FQStat是一种用户友好的工具,具有图形界面,采用并行编程架构并自动优化其性能以生成测序数据的质量控制统计数据。与现有工具不同,这些统计数据是针对多个数据集计算的,并分别在“车道”上计算,\"\"样本,\"和\"实验\"水平,以识别低质量样品的子集,从而防止在仍然可以获得可靠数据时丢失完整的样本。
    BACKGROUND: High throughput DNA/RNA sequencing has revolutionized biological and clinical research. Sequencing is widely used, and generates very large amounts of data, mainly due to reduced cost and advanced technologies. Quickly assessing the quality of giga-to-tera base levels of sequencing data has become a routine but important task. Identification and elimination of low-quality sequence data is crucial for reliability of downstream analysis results. There is a need for a high-speed tool that uses optimized parallel programming for batch processing and simply gauges the quality of sequencing data from multiple datasets independent of any other processing steps.
    RESULTS: FQStat is a stand-alone, platform-independent software tool that assesses the quality of FASTQ files using parallel programming. Based on the machine architecture and input data, FQStat automatically determines the number of cores and the amount of memory to be allocated per file for optimum performance. Our results indicate that in a core-limited case, core assignment overhead exceeds the benefit of additional cores. In a core-unlimited case, there is a saturation point reached in performance by increasingly assigning additional cores per file. We also show that memory allocation per file has a lower priority in performance when compared to the allocation of cores. FQStat\'s output is summarized in HTML web page, tab-delimited text file, and high-resolution image formats. FQStat calculates and plots read count, read length, quality score, and high-quality base statistics. FQStat identifies and marks low-quality sequencing data to suggest removal from downstream analysis. We applied FQStat on real sequencing data to optimize performance and to demonstrate its capabilities. We also compared FQStat\'s performance to similar quality control (QC) tools that utilize parallel programming and attained improvements in run time.
    CONCLUSIONS: FQStat is a user-friendly tool with a graphical interface that employs a parallel programming architecture and automatically optimizes its performance to generate quality control statistics for sequencing data. Unlike existing tools, these statistics are calculated for multiple datasets and separately at the \"lane,\" \"sample,\" and \"experiment\" level to identify subsets of the samples with low quality, thereby preventing the loss of complete samples when reliable data can still be obtained.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    SKESA是基于DeBruijn图的从头组装器,设计用于组装使用Illumina测序的微生物基因组的读数。与SPAdes和MegaHit的比较表明,SKESA产生具有高序列质量和连续性的组件,处理读取中的低水平污染,是快速的,,并在使用相同或不同的计算资源进行多次组装时,为相同的输入生成相同的程序集。SKESA已用于在NCBI的序列读取存档中组装超过272,000个读取集,并用于实时病原体检测。SKESA的源代码可在https://github.com/ncbi/SKESA/releases免费获得。
    SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources. SKESA has been used for assembling over 272,000 read sets in the Sequence Read Archive at NCBI and for real-time pathogen detection. Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Approaches in molecular biology, particularly those that deal with high-throughput sequencing of entire microbial communities (the field of metagenomics), are rapidly advancing our understanding of the composition and functional content of microbial communities involved in climate change, environmental pollution, human health, biotechnology, etc. Metagenomics provides researchers with the most complete picture of the taxonomic (i.e., what organisms are there) and functional (i.e., what are those organisms doing) composition of natively sampled microbial communities, making it possible to perform investigations that include organisms that were previously intractable to laboratory-controlled culturing; currently, these constitute the vast majority of all microbes on the planet. All organisms contained in environmental samples are sequenced in a culture-independent manner, most often with 16S ribosomal amplicon methods to investigate the taxonomic or whole-genome shotgun-based methods to investigate the functional content of sampled communities. Metagenomics allows researchers to characterize the community composition and functional content of microbial communities, but it cannot show which functional processes are active; however, near parallel developments in transcriptomics promise a dramatic increase in our knowledge in this area as well. Since 2008, MG-RAST (Meyer et al., BMC Bioinformatics 9:386, 2008) has served as a public resource for annotation and analysis of metagenomic sequence data, providing a repository that currently houses more than 150,000 data sets (containing 60+ tera-base-pairs) with more than 23,000 publically available. MG-RAST, or the metagenomics RAST (rapid annotation using subsystems technology) server makes it possible for users to upload raw metagenomic sequence data in (preferably) fastq or fasta format. Assessments of sequence quality, annotation with respect to multiple reference databases, are performed automatically with minimal input from the user (see Subheading 4 at the end of this chapter for more details). Post-annotation analysis and visualization are also possible, directly through the web interface, or with tools like matR (metagenomic analysis tools for R, covered later in this chapter) that utilize the MG-RAST API ( http://api.metagenomics.anl.gov/api.html ) to easily download data from any stage in the MG-RAST processing pipeline. Over the years, MG-RAST has undergone substantial revisions to keep pace with the dramatic growth in the number, size, and types of sequence data that accompany constantly evolving developments in metagenomics and related -omic sciences (e.g., metatranscriptomics).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号