关键词: QC algorithm bioinformatics metagenomics microbiome pipeline quality control sequencing short read

来  源:   DOI:10.1128/mSystems.00202-17   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
Next-generation sequencing technology is of great importance for many biological disciplines; however, due to technical and biological limitations, the short DNA sequences produced by modern sequencers require numerous quality control (QC) measures to reduce errors, remove technical contaminants, or merge paired-end reads together into longer or higher-quality contigs. Many tools for each step exist, but choosing the appropriate methods and usage parameters can be challenging because the parameterization of each step depends on the particularities of the sequencing technology used, the type of samples being analyzed, and the stochasticity of the instrumentation and sample preparation. Furthermore, end users may not know all of the relevant information about how their data were generated, such as the expected overlap for paired-end sequences or type of adaptors used to make informed choices. This increasing complexity and nuance demand a pipeline that combines existing steps together in a user-friendly way and, when possible, learns reasonable quality parameters from the data automatically. We propose a user-friendly quality control pipeline called SHI7 (canonically pronounced \"shizen\"), which aims to simplify quality control of short-read data for the end user by predicting presence and/or type of common sequencing adaptors, what quality scores to trim, whether the data set is shotgun or amplicon sequencing, whether reads are paired end or single end, and whether pairs are stitchable, including the expected amount of pair overlap. We hope that SHI7 will make it easier for all researchers, expert and novice alike, to follow reasonable practices for short-read data quality control. IMPORTANCE Quality control of high-throughput DNA sequencing data is an important but sometimes laborious task requiring background knowledge of the sequencing protocol used (such as adaptor type, sequencing technology, insert size/stitchability, paired-endedness, etc.). Quality control protocols typically require applying this background knowledge to selecting and executing numerous quality control steps with the appropriate parameters, which is especially difficult when working with public data or data from collaborators who use different protocols. We have created a streamlined quality control pipeline intended to substantially simplify the process of DNA quality control from raw machine output files to actionable sequence data. In contrast to other methods, our proposed pipeline is easy to install and use and attempts to learn the necessary parameters from the data automatically with a single command.
摘要:
下一代测序技术对许多生物学科都非常重要;然而,由于技术和生物限制,由现代测序仪产生的短DNA序列需要许多质量控制(QC)措施来减少错误,去除技术污染物,或将配对末端读取合并到更长或更高质量的重叠群中。每个步骤都有许多工具,但是选择适当的方法和使用参数可能具有挑战性,因为每个步骤的参数化取决于所使用的测序技术的特殊性,被分析的样本类型,以及仪器和样品制备的随机性。此外,最终用户可能不知道有关其数据如何生成的所有相关信息,例如用于做出明智选择的配对末端序列或衔接子类型的预期重叠。这种日益增加的复杂性和细微差别需要一个管道,以用户友好的方式将现有步骤组合在一起,如果可能,从数据中自动学习合理的质量参数。我们提出了一个用户友好的质量控制管道,称为SHI7(规范发音为“shizen”),旨在通过预测常见测序衔接子的存在和/或类型,为最终用户简化短读数据的质量控制,要修剪什么质量分数,数据集是鸟枪还是扩增子测序,读段是双端还是单端,以及双是否可缝合,包括预期的配对重叠量。我们希望SHI7将使所有研究人员更容易,专家和新手一样,遵循合理的短读数据质量控制实践。重要性高通量DNA测序数据的质量控制是一项重要但有时费力的任务,需要所使用的测序协议的背景知识(例如衔接子类型,测序技术,插入尺寸/可缝合性,配对,等。).质量控制方案通常需要应用这种背景知识来选择和执行具有适当参数的许多质量控制步骤。这在处理公共数据或使用不同协议的协作者的数据时尤其困难。我们创建了一个简化的质量控制管道,旨在大大简化从原始机器输出文件到可操作序列数据的DNA质量控制过程。与其他方法相比,我们建议的管道易于安装和使用,并尝试使用单个命令自动从数据中学习必要的参数。
公众号