sequencing depth

测序深度
  • 文章类型: Journal Article
    用于产生SNP数据的全基因组测序越来越多地用于群体遗传研究。然而,获得大量样本的基因组仍然不在许多研究人员的预算之内。因此,必须选择合适的参考基因组和测序深度,以确保特定研究问题的结果的准确性,同时平衡成本和可行性。为了评估参考基因组和测序深度的选择对下游分析的影响,我们使用了五个可变相关性的家族参考基因组和三个测序深度水平(3.5×,7.5×和12×)在对两种caddisfly物种的种群基因组研究中:喜马拉雅和西藏。使用这30个数据集(五个参考基因组×三个深度×两个目标物种),我们估计了群体遗传指数(近交系数,核苷酸多样性,成对FST,和FST的全基因组分布)基于变体和基于基因型似然估计的群体结构(PCA和混合物)。结果表明,远相关的参考基因组和较低的测序深度都会导致分辨率下降。此外,选择一个更密切相关的参考基因组可以显著弥补低深度造成的缺陷。因此,我们得出结论,群体遗传研究将受益于密切相关的参考基因组,特别是随着获得高质量参考基因组的成本不断降低。然而,为了确定特定人群基因组研究的成本效益策略,可以考虑参考基因组相关性和测序深度之间的权衡。
    Whole genome sequencing for generating SNP data is increasingly used in population genetic studies. However, obtaining genomes for massive numbers of samples is still not within the budgets of many researchers. It is thus imperative to select an appropriate reference genome and sequencing depth to ensure the accuracy of the results for a specific research question, while balancing cost and feasibility. To evaluate the effect of the choice of the reference genome and sequencing depth on downstream analyses, we used five confamilial reference genomes of variable relatedness and three levels of sequencing depth (3.5×, 7.5× and 12×) in a population genomic study on two caddisfly species: Himalopsyche digitata and H. tibetana. Using these 30 datasets (five reference genomes × three depths × two target species), we estimated population genetic indices (inbreeding coefficient, nucleotide diversity, pairwise F ST, and genome-wide distribution of F ST) based on variants and population structure (PCA and admixture) based on genotype likelihood estimates. The results showed that both distantly related reference genomes and lower sequencing depth lead to degradation of resolution. In addition, choosing a more closely related reference genome may significantly remedy the defects caused by low depth. Therefore, we conclude that population genetic studies would benefit from closely related reference genomes, especially as the costs of obtaining a high-quality reference genome continue to decrease. However, to determine a cost-efficient strategy for a specific population genomic study, a trade-off between reference genome relatedness and sequencing depth can be considered.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    高通量实验是现代生物和生物医学研究的重要组成部分。由于低于检测水平的信号,高通量生物实验的结果通常具有许多缺失的观察结果。例如,大多数单细胞RNA-seq(scRNA-seq)方案由于少量的起始材料而经历高水平的脱落,导致大多数报告的表达水平为零。虽然缺失的数据包含有关再现性的信息,它们通常被排除在可重复性评估中,可能会产生误导性的评估。在这篇文章中,我们开发了一个回归模型来评估高通量实验的可重复性如何受到操作因素选择的影响(例如,平台或测序深度),当大量测量缺失时。使用潜在变量方法,我们扩展了对应曲线回归,最近提出的一种评估操作因素对再现性的影响的方法,合并缺失的值。使用模拟,我们表明,我们的方法是更准确的检测差异的再现性比现有的措施的再现性。我们使用在HCT116细胞上收集的单细胞RNA-seq数据集说明了我们方法的有用性。我们比较了不同文库制备平台的可重复性,并研究了测序深度对可重复性的影响,从而确定实现足够再现性所需的成本有效的测序深度。
    High-throughput experiments are an essential part of modern biological and biomedical research. The outcomes of high-throughput biological experiments often have a lot of missing observations due to signals below detection levels. For example, most single-cell RNA-seq (scRNA-seq) protocols experience high levels of dropout due to the small amount of starting material, leading to a majority of reported expression levels being zero. Though missing data contain information about reproducibility, they are often excluded in the reproducibility assessment, potentially generating misleading assessments. In this article, we develop a regression model to assess how the reproducibility of high-throughput experiments is affected by the choices of operational factors (eg, platform or sequencing depth) when a large number of measurements are missing. Using a latent variable approach, we extend correspondence curve regression, a recently proposed method for assessing the effects of operational factors to reproducibility, to incorporate missing values. Using simulations, we show that our method is more accurate in detecting differences in reproducibility than existing measures of reproducibility. We illustrate the usefulness of our method using a single-cell RNA-seq dataset collected on HCT116 cells. We compare the reproducibility of different library preparation platforms and study the effect of sequencing depth on reproducibility, thereby determining the cost-effective sequencing depth that is required to achieve sufficient reproducibility.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着下一代测序技术的出现,研究者可以获得更高质量的测序数据.然而,使用下一代测序对研究中的所有样品进行测序仍然可能过于昂贵。一种潜在的补救措施可能是将来自病例的下一代测序数据与公开可用的对照测序数据相结合。但是测序数据的质量可能存在系统性差异,比如测序深度,在测序的研究案例和公开可用的对照之间。我们提出了一种基于回归校准(RC)的方法和一种最大似然方法,通过考虑病例和对照之间的差异测序错误,对这种组合样本进行关联研究。这些方法允许调整协变量,如人口分层作为混杂因素。两种方法控制I型误差,并且具有与使用具有足够高但不同的测序深度的真实基因型进行的分析相当的能力。我们表明,在某些情况下,RC方法允许使用朴素方差估计(在实践中非常接近真实方差)和标准软件进行分析。我们使用模拟研究评估所提出的方法的性能,并将我们的方法应用于来自1000Genomes项目的外显子组测序的急性肺损伤病例和健康对照的组合数据集。
    With the advent of next-generation sequencing, investigators have access to higher quality sequencing data. However, to sequence all samples in a study using next generation sequencing can still be prohibitively expensive. One potential remedy could be to combine next generation sequencing data from cases with publicly available sequencing data for controls, but there could be a systematic difference in quality of sequenced data, such as sequencing depths, between sequenced study cases and publicly available controls. We propose a regression calibration (RC)-based method and a maximum-likelihood method for conducting an association study with such a combined sample by accounting for differential sequencing errors between cases and controls. The methods allow for adjusting for covariates, such as population stratification as confounders. Both methods control type I error and have comparable power to analysis conducted using the true genotype with sufficiently high but different sequencing depths. We show that the RC method allows for analysis using naive variance estimate (closely approximates true variance in practice) and standard software under certain circumstances. We evaluate the performance of the proposed methods using simulation studies and apply our methods to a combined data set of exome sequenced acute lung injury cases and healthy controls from the 1000 Genomes project.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号