PCR duplicates

PCR 重复
  • 文章类型: Journal Article
    背景:详细描述了对来自Illumina下一代测序(NGS)的组件的连续性和准确性产生不利影响的参数。然而,过去的研究通常集中在它们的加性效应上,忽略它们的潜在相互作用,可能以倍增的方式加剧彼此的影响。为了调查它们是否对从头基因组组装质量起相互作用,我们模拟了13个细菌参考基因组的测序数据,随着错误率水平的变化,测序深度,PCR和光学重复比。
    结果:我们从模拟的测序数据中评估了组件的质量,并使用了一些连续性和准确性指标,我们用它来量化四个参数的加性和乘法效应。我们发现测试的参数参与复杂的相互作用,发挥乘法,而不是添加剂,对装配质量的影响。此外,原始基因组的非重复区域的比率和GC%可以决定四个参数如何影响组装质量。
    结论:我们提供了一个框架,供未来研究使用细菌基因组的从头基因组组装,例如,在选择最佳测序深度时,由于其与错误率的相互作用,它对连续性的积极影响和对准确性的消极影响之间的平衡。此外,还应考虑要测序的基因组的特性,因为它们可能会影响错误源本身的影响。
    BACKGROUND: Parameters adversely affecting the contiguity and accuracy of the assemblies from Illumina next-generation sequencing (NGS) are well described. However, past studies generally focused on their additive effects, overlooking their potential interactions possibly exacerbating one another\'s effects in a multiplicative manner. To investigate whether or not they act interactively on de novo genome assembly quality, we simulated sequencing data for 13 bacterial reference genomes, with varying levels of error rate, sequencing depth, PCR and optical duplicate ratios.
    RESULTS: We assessed the quality of assemblies from the simulated sequencing data with a number of contiguity and accuracy metrics, which we used to quantify both additive and multiplicative effects of the four parameters. We found that the tested parameters are engaged in complex interactions, exerting multiplicative, rather than additive, effects on assembly quality. Also, the ratio of non-repeated regions and GC% of the original genomes can shape how the four parameters affect assembly quality.
    CONCLUSIONS: We provide a framework for consideration in future studies using de novo genome assembly of bacterial genomes, e.g. in choosing the optimal sequencing depth, balancing between its positive effect on contiguity and negative effect on accuracy due to its interaction with error rate. Furthermore, the properties of the genomes to be sequenced also should be taken into account, as they might influence the effects of error sources themselves.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大多数测序技术的文库制备方案涉及模板DNA的PCR扩增,这打开了给定模板DNA分子被多次测序的可能性。从这种现象中产生的读数,称为PCR重复,增加测序成本,并可能危及受影响实验的可靠性。尽管这种文物无处不在,我们对其原因及其对下游统计分析的影响的理解基本上仍然是经验性的。这里,我们开发了测序数据集中扩增失真的一般定量模型,我们利用它来研究控制PCR重复发生的因素。我们表明,PCR重复率主要由文库复杂性和测序深度之间的比率决定,并且扩增噪声(包括其对PCR循环数的依赖性)仅对该伪影起次要作用。我们使用新的和已发布的RAD-seq文库来确认我们的预测,并提供一种方法来估计包含PCR重复的任何数据集中的文库复杂性和扩增噪声。我们讨论了与扩增相关的伪影如何影响下游分析,特别是基因分型的准确性。所提出的框架结合了对PCR重复进行的大量观察,对于所有关注DNA可用性的测序技术的实验者将是有用的。
    Library preparation protocols for most sequencing technologies involve PCR amplification of the template DNA, which open the possibility that a given template DNA molecule is sequenced multiple times. Reads arising from this phenomenon, known as PCR duplicates, inflate the cost of sequencing and can jeopardize the reliability of affected experiments. Despite the pervasiveness of this artefact, our understanding of its causes and of its impact on downstream statistical analyses remains essentially empirical. Here, we develop a general quantitative model of amplification distortions in sequencing data sets, which we leverage to investigate the factors controlling the occurrence of PCR duplicates. We show that the PCR duplicate rate is determined primarily by the ratio between library complexity and sequencing depth, and that amplification noise (including in its dependence on the number of PCR cycles) only plays a secondary role for this artefact. We confirm our predictions using new and published RAD-seq libraries and provide a method to estimate library complexity and amplification noise in any data set containing PCR duplicates. We discuss how amplification-related artefacts impact downstream analyses, and in particular genotyping accuracy. The proposed framework unites the numerous observations made on PCR duplicates and will be useful to experimenters of all sequencing technologies where DNA availability is a concern.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    下一代测序(NGS)极大地改善了癌症研究和临床试验的灵活性和结果,为大规模基因组测试提供高度敏感和准确的高通量平台。与全基因组(WGS)或全外显子组测序(WES)相比,靶向基因组测序(TS)专注于一组已知与疾病发病机理和/或临床相关性密切相关的基因或靶标,提供更大的测序深度,降低成本和数据负担。这允许靶向测序以高置信度识别靶向区域中的低频率变体。因此适用于分析低质量和片段化的临床DNA样本。因此,TS已广泛用于临床研究和试验,用于患者分层和靶向疗法的开发。然而,它向常规临床应用的过渡一直很缓慢。许多技术和分析障碍仍然存在,需要在大规模和跨中心实施之前进行讨论和解决。迫切需要黄金标准和最先进的程序和管道来加速这一过渡。在这篇综述中,我们首先介绍了TS是如何在癌症研究中进行的,包括各种目标浓缩平台,目标面板的构建,以及利用TS对临床样本进行分析的选定研究和临床研究。然后,我们为TS数据提供了一个通用的分析工作流程,详细讨论了重要的参数和过滤器,旨在提供TS使用和分析的最佳实践。
    Next Generation Sequencing (NGS) has dramatically improved the flexibility and outcomes of cancer research and clinical trials, providing highly sensitive and accurate high-throughput platforms for large-scale genomic testing. In contrast to whole-genome (WGS) or whole-exome sequencing (WES), targeted genomic sequencing (TS) focuses on a panel of genes or targets known to have strong associations with pathogenesis of disease and/or clinical relevance, offering greater sequencing depth with reduced costs and data burden. This allows targeted sequencing to identify low frequency variants in targeted regions with high confidence, thus suitable for profiling low-quality and fragmented clinical DNA samples. As a result, TS has been widely used in clinical research and trials for patient stratification and the development of targeted therapeutics. However, its transition to routine clinical use has been slow. Many technical and analytical obstacles still remain and need to be discussed and addressed before large-scale and cross-centre implementation. Gold-standard and state-of-the-art procedures and pipelines are urgently needed to accelerate this transition. In this review we first present how TS is conducted in cancer research, including various target enrichment platforms, the construction of target panels, and selected research and clinical studies utilising TS to profile clinical samples. We then present a generalised analytical workflow for TS data discussing important parameters and filters in detail, aiming to provide the best practices of TS usage and analyses.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    BACKGROUND: RNA-seq and small RNA-seq are powerful, quantitative tools to study gene regulation and function. Common high-throughput sequencing methods rely on polymerase chain reaction (PCR) to expand the starting material, but not every molecule amplifies equally, causing some to be overrepresented. Unique molecular identifiers (UMIs) can be used to distinguish undesirable PCR duplicates derived from a single molecule and identical but biologically meaningful reads from different molecules.
    RESULTS: We have incorporated UMIs into RNA-seq and small RNA-seq protocols and developed tools to analyze the resulting data. Our UMIs contain stretches of random nucleotides whose lengths sufficiently capture diverse molecule species in both RNA-seq and small RNA-seq libraries generated from mouse testis. Our approach yields high-quality data while allowing unique tagging of all molecules in high-depth libraries.
    CONCLUSIONS: Using simulated and real datasets, we demonstrate that our methods increase the reproducibility of RNA-seq and small RNA-seq data. Notably, we find that the amount of starting material and sequencing depth, but not the number of PCR cycles, determine PCR duplicate frequency. Finally, we show that computational removal of PCR duplicates based only on their mapping coordinates introduces substantial bias into data analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Comparative Study
    使用单一摘要与使用单一摘要的权衡双消化限制性位点相关DNA测序(RAD-seq)方案已被广泛讨论.然而,没有对这两种方法进行直接的经验比较。这里,我们采样了单个海湾pipe鱼(Syngnathusscovelli)种群,并使用RAD-seq对444个个体进行了基因分型。对60名个体进行单消化RAD-seq(sdRAD-seq),其余384人采用双消化RAD-seq(ddRAD-seq)方案进行基因分型。我们分析了所得Illumina测序数据,并在一起或单独分析读段时比较了两种基因分型方法。覆盖率统计,观察到的杂合性,两种方案之间的等位基因频率差异很大,选择成分分析的结果也是如此。我们还进行了一个在硅消化的海湾pipefish基因组和模型的五个主要来源的偏倚:PCR重复,多态性限制性位点,剪切偏压,非对称采样(即,与ddRAD-seq相比,对sdRAD-seq的个体进行基因分型)和更高的主要等位基因频率。这种方法的组合使我们能够确定多态性限制性位点,非对称抽样方案,平均等位基因频率和某种程度上的PCR重复都有助于使用sdRAD-seq和ddRAD-seq基因分型的样品之间的等位基因频率的不同估计。我们发现sdRAD-seq和ddRAD-seq可以导致不同的等位基因频率,这对于研究和技术之间的比较具有意义,这些研究和技术致力于识别自然种群进化过程的全基因组特征。
    The trade-offs of using single-digest vs. double-digest restriction site-associated DNA sequencing (RAD-seq) protocols have been widely discussed. However, no direct empirical comparisons of the two methods have been conducted. Here, we sampled a single population of Gulf pipefish (Syngnathus scovelli) and genotyped 444 individuals using RAD-seq. Sixty individuals were subjected to single-digest RAD-seq (sdRAD-seq), and the remaining 384 individuals were genotyped using a double-digest RAD-seq (ddRAD-seq) protocol. We analysed the resulting Illumina sequencing data and compared the two genotyping methods when reads were analysed either together or separately. Coverage statistics, observed heterozygosity, and allele frequencies differed significantly between the two protocols, as did the results of selection components analysis. We also performed an in silico digestion of the Gulf pipefish genome and modelled five major sources of bias: PCR duplicates, polymorphic restriction sites, shearing bias, asymmetric sampling (i.e., genotyping fewer individuals with sdRAD-seq than with ddRAD-seq) and higher major allele frequencies. This combination of approaches allowed us to determine that polymorphic restriction sites, an asymmetric sampling scheme, mean allele frequencies and to some extent PCR duplicates all contribute to different estimates of allele frequencies between samples genotyped using sdRAD-seq versus ddRAD-seq. Our finding that sdRAD-seq and ddRAD-seq can result in different allele frequencies has implications for comparisons across studies and techniques that endeavour to identify genomewide signatures of evolutionary processes in natural populations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    背景:PCR扩增是在高通量测序之前制备DNA测序文库的重要步骤。PCR扩增在序列数据中引入冗余读段,并且估计PCR复制速率对于评估此类读段的频率是重要的。现有的计算方法无法将PCR重复与代表独立DNA片段的“自然”读取重复区分开来,因此,过度估计DNA-seq和RNA-seq实验的PCR复制率。
    结果:在本文中,我们提出了一种计算方法来估计高通量序列数据集的平均PCR复制率,该数据集通过利用单个基因组中的杂合变体来解释自然阅读重复.来自1000Genomes项目的模拟数据和外显子组序列数据的分析表明,我们的方法可以准确地估计包含高比例的自然读段重复的配对端和单端读段数据集上的PCR复制率。Further,使用Nextera文库制备方法制备的外显子组数据集的分析表明,45-50%的读段重复对应于自然读段重复,可能是由于片段化偏差。最后,来自1000Genomes项目中个体的RNA-seq数据集的分析表明,在此类数据集中观察到的70-95%的读段重复对应于从具有高表达的基因采样的自然重复,并鉴定了异常样本,其PCR复制率比其他样本高2倍。
    结论:此处描述的方法是用于估计高通量序列数据集的PCR复制率和用于评估对应于自然读段重复的读段重复的分数的有用工具。该方法的实现可在https://github.com/vibansal/PCRduplicates获得。
    BACKGROUND: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from \"natural\" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.
    RESULTS: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.
    CONCLUSIONS: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates .
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    The identification of thousands of variants across the genomes and their accurate genotyping are crucial for estimating the genetic parameters needed to address a host of molecular ecological and evolutionary questions. With rapid advances of massively parallel high-throughput sequencing technologies, several methods have recently been developed to access genomewide data on population variation. One of the most successful and widely used techniques relies on the combination of restriction enzymes and sequencing-by-synthesis: restriction-site-associated DNA sequencing (RADSeq). We developed a new, more time- and cost-efficient double-digest RAD paired-end protocol (quaddRAD) that simplifies and speeds up the identification of PCR duplicates and permits large-scale multiplexing. Assessing its performance on a technical data set, we also applied the quaddRAD method on population samples of a Neotropical cichlid fish lineage (Archocentrus centrarchus) to assess its genetic structure and demographic history. While we identified allopatric interlake genetic divergence, most likely driven by drift, no signature of sympatric divergence was detected. This differs from what has been observed in the clade of Midas cichlids (Amphilophus citrinellus spp.), another cichlid lineage that inhabits the same lakes and shares a similar demographic history, but has evolved into small-scale adaptive radiations via sympatric speciation. We demonstrate that quaddRAD is a robust and efficient method for genotyping a massive number and widely overlapping set of loci with high accuracy. Furthermore, the results on A. centrarchus open new research avenues providing an ideal system to investigate genome-level mechanisms that could alter the speciation potential of different but closely related cichlid lineages.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Tag-Seq是用于发现SNP和表征基因表达的高通量方法。与RNA-Seq相比,Tag-Seq简化了数据处理,并允许每个转录物分子仅使用一个标签来检测稀有mRNA种类。然而,降低的文库复杂性引发了PCR重复的问题,扭曲了基因表达水平。在这里,我们提出了一种新的Tag-Seq协议,该协议使用最小偏向方法进行RNA文库制备,并结合了联合PCR模板和样品标记的新方法。在我们的协议中,输入RNA通过水解片段化,和携带poly(A)的RNA被选择并直接连接到混合的DNA-RNAP5衔接子。P5衔接子包含由样品特异性(中度)简并碱基区(mDBR)组成的i5条形码,这稍后允许检测PCR重复。P7衔接子通过逆转录与扩增步骤期间添加的单个i7条形码连接。得到的文库可以在Illumina测序仪上测序。在使用我们设计的免费软件工具进行样本解复用和PCR重复删除之后,数据已准备好进行下游分析。我们的方案在来自捕食者诱导和对照的Daphnia微甲壳类动物的RNA样品上进行了测试。
    Tag-Seq is a high-throughput approach used for discovering SNPs and characterizing gene expression. In comparison to RNA-Seq, Tag-Seq eases data processing and allows detection of rare mRNA species using only one tag per transcript molecule. However, reduced library complexity raises the issue of PCR duplicates, which distort gene expression levels. Here we present a novel Tag-Seq protocol that uses the least biased methods for RNA library preparation combined with a novel approach for joint PCR template and sample labeling. In our protocol, input RNA is fragmented by hydrolysis, and poly(A)-bearing RNAs are selected and directly ligated to mixed DNA-RNA P5 adapters. The P5 adapters contain i5 barcodes composed of sample-specific (moderately) degenerate base regions (mDBRs), which later allow detection of PCR duplicates. The P7 adapter is attached via reverse transcription with individual i7 barcodes added during the amplification step. The resulting libraries can be sequenced on an Illumina sequencer. After sample demultiplexing and PCR duplicate removal with a free software tool we designed, the data are ready for downstream analysis. Our protocol was tested on RNA samples from predator-induced and control Daphnia microcrustaceans.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Double-digested RADseq (ddRADseq) is a NGS methodology that generates reads from thousands of loci targeted by restriction enzyme cut sites, across multiple individuals. To be statistically sound and economically optimal, a ddRADseq experiment has a preliminary design stage that needs to consider issues related to the selection of enzymes, particular features of the genome of the focal species, possible modifications to the library construction protocol, coverage needed to minimize missing data, and the potential sources of error that may impact upon the coverage. We present ddradseqtools, a software package to help ddRADseq experimental design by (i) the generation of in silico double-digested fragments; (ii) the construction of modified ddRADseq libraries using adapters with either one or two indexes and degenerate base regions (DBRs) to quantify PCR duplicates; and (iii) the initial steps of the bioinformatics preprocessing of reads. ddradseqtools generates single-end (SE) or paired-end (PE) reads that may bear SNPs and/or indels. The effect of allele dropout and PCR duplicates on coverage is also simulated. The resulting output files can be submitted to pipelines of alignment and variant calling, to allow the fine-tuning of parameters. The software was validated with specific tests for the correct operability of the program. The correspondence between in silico settings and parameters from ddRADseq in vitro experiments was assessed to provide guidelines for the reliable performance of the software. ddradseqtools is cost-efficient in terms of execution time, and can be run on computers with standard CPU and RAM configuration.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    Puritz et al. provide a review of several RADseq methodological approaches in response to our \'Population Genomic Data Analysis\' workshop (Sept 2013) review (Andrews & Luikart 2014). We agree with Puritz et al. on the importance for researchers to thoroughly understand RADseq library preparation and data analysis when choosing an approach for answering their research questions. Some of us are currently using multiple RADseq protocols, and we agree that the different methods may offer advantages in different cases. Our workshop review did not intend to provide a thorough review of RADseq because the workshop covered a broad range of topics within the field of population genomics. Similarly, neither the response of Puritz et al. nor our comments here provide sufficient space to thoroughly review RADseq. Nonetheless, here we address some key points that we find unclear or potentially misleading in their evaluation of techniques.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号