Variant calling

变体调用
  • 文章类型: Journal Article
    背景:在全球范围内,SARS-CoV-2病毒在很长一段时间内没有保持其初始基因型,2020年底首次发布全球关注变种(VOCs)报告。随后,基因组测序已成为表征正在进行的大流行的不可或缺的工具,特别是用于从患者或环境监测中获得的SARS-CoV-2样本的分型。对于这种SARS-CoV-2分型,存在各种体外和计算机工作流程,到目前为止,没有系统的跨平台验证报告.
    结果:在这项工作中,我们提出了第一个全面的跨平台评估和验证silicoSARS-CoV-2分型工作流程。评估依赖于在所有相关的现有技术测序平台上用几种不同的体外方法测序的54个患者来源的样品的数据集。此外,我们介绍UnCoVar,一个健壮的,生产级可重复的SARS-CoV-2分型工作流程,在精确度和召回率方面优于所有其他测试方法。
    结论:在许多方面,SARS-CoV-2大流行加速了技术和分析方法的发展。我们认为,这可以作为应对未来流行病的蓝图。因此,UnCoVar很容易推广到其他病毒病原体和未来的大流行。全自动工作流程从患者样本中组装病毒基因组,识别现有的血统,并提供对个体突变的高分辨率见解。UnCoVar包括广泛的质量控制,并自动生成交互式可视化报告。UnCoVar作为Snakemake工作流实现。开源代码可在github.com/IKIM-Essen/uncovar上获得BSD2条款许可。
    BACKGROUND: At a global scale, the SARS-CoV-2 virus did not remain in its initial genotype for a long period of time, with the first global reports of variants of concern (VOCs) in late 2020. Subsequently, genome sequencing has become an indispensable tool for characterizing the ongoing pandemic, particularly for typing SARS-CoV-2 samples obtained from patients or environmental surveillance. For such SARS-CoV-2 typing, various in vitro and in silico workflows exist, yet to date, no systematic cross-platform validation has been reported.
    RESULTS: In this work, we present the first comprehensive cross-platform evaluation and validation of in silico SARS-CoV-2 typing workflows. The evaluation relies on a dataset of 54 patient-derived samples sequenced with several different in vitro approaches on all relevant state-of-the-art sequencing platforms. Moreover, we present UnCoVar, a robust, production-grade reproducible SARS-CoV-2 typing workflow that outperforms all other tested approaches in terms of precision and recall.
    CONCLUSIONS: In many ways, the SARS-CoV-2 pandemic has accelerated the development of techniques and analytical approaches. We believe that this can serve as a blueprint for dealing with future pandemics. Accordingly, UnCoVar is easily generalizable towards other viral pathogens and future pandemics. The fully automated workflow assembles virus genomes from patient samples, identifies existing lineages, and provides high-resolution insights into individual mutations. UnCoVar includes extensive quality control and automatically generates interactive visual reports. UnCoVar is implemented as a Snakemake workflow. The open-source code is available under a BSD 2-clause license at github.com/IKIM-Essen/uncovar.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    靶标捕获系统与下一代测序的整合已成为探索具有高分辨率的特定遗传区域并促进新等位基因的快速发现的有效工具。尽管取得了这些进步,靶向测序方法的应用,比如myBaits技术,在多倍体燕麦物种中仍然相对未被探索。在这项研究中,我们利用DaicelArborBiosciences提供的myBaits靶标捕获方法来检测变异体,并评估其在燕麦基因组学和育种中变异体检测的可靠性.精心选择了10种燕麦基因型进行靶向测序,专注于染色体2A上的特定区域以检测变异。所选区域包含98个基因。靶向这些区域内的基因的精确设计的诱饵用于靶捕获测序。我们采用了各种映射器和变体调用者来识别变体。在识别变体之后,我们重点研究了通过所有变体调用者鉴定的变体,以评估myBaits测序方法在燕麦育种中的适用性。在我们努力验证已识别的变体时,我们专注于两个SNP,通过基因型KF-318和NOS819111-70中的所有变体调用者鉴定了一个缺失和一个插入,但在其余八个基因型中不存在。靶向SNP的Sanger测序未能重现通过myBaits技术获得的靶标捕获数据。同样,通过高分辨率熔解(HRM)曲线分析验证缺失和插入变体也未能重现靶标捕获数据,再次表明,使用短读取测序进行燕麦基因组变异检测的myBaits靶捕获测序的可靠性存在局限性。这项研究阐明了在采用myBaits目标捕获策略进行燕麦变异检测时谨慎行事的重要性。这项研究为育种者寻求使用myBaits靶标捕获测序来推进燕麦育种工作和标记开发提供了有价值的见解,强调方法测序在燕麦基因组学研究中的重要性。
    The integration of target capture systems with next-generation sequencing has emerged as an efficient tool for exploring specific genetic regions with a high resolution and facilitating the rapid discovery of novel alleles. Despite these advancements, the application of targeted sequencing methodologies, such as the myBaits technology, in polyploid oat species remains relatively unexplored. In this study, we utilized the myBaits target capture method offered by Daicel Arbor Biosciences to detect variants and assess their reliability for variant detection in oat genomics and breeding. Ten oat genotypes were carefully chosen for targeted sequencing, focusing on specific regions on chromosome 2A to detect variants. The selected region harbors 98 genes. Precisely designed baits targeting the genes within these regions were employed for the target capture sequencing. We employed various mappers and variant callers to identify variants. After the identification of variants, we focused on the variants identified via all variants callers to assess the applicability of the myBaits sequencing methodology in oat breeding. In our efforts to validate the identified variants, we focused on two SNPs, one deletion and one insertion identified via all variant callers in the genotypes KF-318 and NOS 819111-70 but absent in the remaining eight genotypes. The Sanger sequencing of targeted SNPs failed to reproduce target capture data obtained through the myBaits technology. Similarly, the validation of deletion and insertion variants via high-resolution melting (HRM) curve analysis also failed to reproduce target capture data, again suggesting limitations in the reliability of the myBaits target capture sequencing using short-read sequencing for variant detection in the oat genome. This study shed light on the importance of exercising caution when employing the myBaits target capture strategy for variant detection in oats. This study provides valuable insights for breeders seeking to advance oat breeding efforts and marker development using myBaits target capture sequencing, emphasizing the significance of methodological sequencing considerations in oat genomics research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    全基因组测序被广泛用于研究感兴趣的生物体中的群体基因组变异。已独立开发了分类工具,以从与参考基因组对齐的短读取测序数据中调用变体,包括单核苷酸多态性(SNP)和结构变异(SV)。我们开发了SNP-SVant,一个综合的,灵活,和计算有效的生物信息学工作流程,可预测生物体中的高置信度SNP和SV,而无需基准变体,传统上用于区分测序错误与真实变体。在没有这些基准数据集的情况下,我们利用多轮统计重新校准来提高变体预测的精度。SNP-SVant工作流程灵活,与用户选项来权衡精度的灵敏度。该工作流程使用基因组分析工具包(GATK)预测SNP和小的插入和删除,并使用基因组重排识别软件套件(GRIDSS)预测SV,,它使用自定义脚本在变体注释中达到顶峰。SNP-SVant的关键效用是其可扩展性。变体调用是一个计算昂贵的过程,因此,SNP-SVant使用具有中间检查点步骤的工作流管理系统,通过最小化冗余计算和省略依赖文件可用的步骤来确保资源的有效利用。SNP-SVant还提供指标来评估所调用变体的质量,并在VCF和对齐的FASTA格式输出之间进行转换,以确保与下游工具的兼容性来计算选择统计信息。这在人口基因组学研究中很常见。通过考虑小型和大型结构变体,该工作流程的用户可以获得感兴趣的生物体中基因组改变的广泛视图。总的来说,这个工作流程提高了我们评估不同类型基因组改变的功能后果的能力,最终提高我们将基因型与表型相关联的能力。©2024作者WileyPeriodicalsLLC出版的当前协议。基本方案:预测单核苷酸多态性和结构变异支持方案1:下载公开可用的测序数据支持方案2:使用整合的基因组查看器可视化变异基因座支持方案3:在VCF和对齐的FASTA格式之间转换。
    Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在癌症基因组学中,变体调用已经高级,但是传统的平均准确性评估不足以用于生物标志物,例如肿瘤突变负担,不同样本之间差异很大,影响免疫治疗患者的选择和阈值设置。在这项研究中,我们介绍TMBstable,一种创新的方法,使用元学习框架为特定的基因组区域动态选择最佳的变体调用策略,用统一的全样本策略将其与传统的呼叫者区分开来。该过程从将样本分割为窗口并提取用于聚类的元特征开始,然后使用预训练的元模型为每个集群选择合适的算法,从而解决策略样本不匹配的问题,减少性能波动并确保各种样品的性能一致。我们使用模拟和真实的非小细胞肺癌和鼻咽癌样本评估了TMBstable,将其与高级呼叫者进行比较。评估,以稳定措施为重点,如假阳性率的方差和变异系数,假阴性率,精确度和召回率,涉及300个模拟肿瘤样本和106个真实肿瘤样本。基准结果显示TMBstable具有优异的稳定性,各性能指标的方差和变异系数最低,强调其在分析基于计数的生物标志物方面的有效性。TMBstable算法可以在https://github.com/hello-json/TMBstable访问,仅供学术使用。
    In cancer genomics, variant calling has advanced, but traditional mean accuracy evaluations are inadequate for biomarkers like tumor mutation burden, which vary significantly across samples, affecting immunotherapy patient selection and threshold settings. In this study, we introduce TMBstable, an innovative method that dynamically selects optimal variant calling strategies for specific genomic regions using a meta-learning framework, distinguishing it from traditional callers with uniform sample-wide strategies. The process begins with segmenting the sample into windows and extracting meta-features for clustering, followed by using a pre-trained meta-model to select suitable algorithms for each cluster, thereby addressing strategy-sample mismatches, reducing performance fluctuations and ensuring consistent performance across various samples. We evaluated TMBstable using both simulated and real non-small cell lung cancer and nasopharyngeal carcinoma samples, comparing it with advanced callers. The assessment, focusing on stability measures, such as the variance and coefficient of variation in false positive rate, false negative rate, precision and recall, involved 300 simulated and 106 real tumor samples. Benchmark results showed TMBstable\'s superior stability with the lowest variance and coefficient of variation across performance metrics, highlighting its effectiveness in analyzing the counting-based biomarker. The TMBstable algorithm can be accessed at https://github.com/hello-json/TMBstable for academic usage only.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对临床样本进行基因组测序以鉴定SARS-CoV-2的新变体,一直是遏制病毒传播的关键公共卫生工具。因此,在COVID-19大流行期间,对数量空前的SARS-CoV-2基因组进行了测序,可以快速鉴定遗传变异,能够及时设计和测试疗法,并部署新的疫苗配方,以对抗新的变种。然而,尽管深度测序的技术进步,对全球生成的原始序列数据的分析既不标准化也不一致,导致可能影响变体鉴定的完全不同的序列。这里,我们表明,对于Illumina和Oxford纳米孔测序平台,工业使用的下游生物信息学协议,政府,和学术团体从同一样本中得出不同的病毒序列。这些生物信息学工作流程产生了单核苷酸多态性差异的共有基因组,插入的包含和排除,和/或删除,尽管使用相同的原始序列作为输入数据集。这里,我们比较和表征了这种差异,并提出了一套具体的参数和协议,应在整个领域采用。生物信息学工作流程的一致结果是SARS-CoV-2和未来病原体监测工作的基础,包括大流行准备,允许数据驱动和及时的公共卫生响应。
    Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    结构变异体(SV)是一种重要类型的遗传变异体,可以显着影响表型。因此,SVs的鉴定是现代基因组分析的重要组成部分。在这篇文章中,我们介绍Kled,一个超快速和敏感的SV调用长读测序数据给出了一个特殊设计的方法与一个新的签名合并算法,自定义细化策略和高性能程序结构。评估结果表明,与针对不同平台和测序深度的模拟和真实长读数数据的几种最新方法相比,kled可以实现最佳SV调用。此外,kled擅长快速SV调用,可以有效地利用多个中央处理器(CPU)内核,同时保持低内存使用率。kled的源代码可以从https://github.com/CoREse/kled获得。
    Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因组测序数据在个性化医疗和诊断领域变得越来越重要。然而,准确检测基因组变异仍然是一项具有挑战性的任务。传统的变化检测方法依赖于人工检查或预定义的规则,这可能是耗时且容易出错的。因此,基于深度学习的变异检测方法由于能够自动学习区分变异的基因组特征而受到关注。在我们的审查中,我们讨论了基于深度学习的算法的最新进展,用于检测基因组数据中的小变化和结构变化,以及它们的优点和局限性。
    Genome sequencing data have become increasingly important in the field of personalized medicine and diagnosis. However, accurately detecting genomic variations remains a challenging task. Traditional variation detection methods rely on manual inspection or predefined rules, which can be time-consuming and prone to errors. Consequently, deep learning-based approaches for variation detection have gained attention due to their ability to automatically learn genomic features that distinguish between variants. In our review, we discuss the recent advancements in deep learning-based algorithms for detecting small variations and structural variations in genomic data, as well as their advantages and limitations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    越来越多的证据认为结构变异(SV)和重复DNA序列是在种内和种间水平上塑造现有葡萄表型多样性的关键因素。为了加深我们对丰富的理解,多样性,以及SV和重复DNA的分布,包括转座因子(TE)和串联重复的卫星DNA(satDNA),我们重新测序了古葡萄Aglianico和Falanghina的基因组。大拷贝数变体(CNV)的分析检测到了与这些品种的环境特征有关的候选多态性基因。在对Aglianico和Falanghina序列与21种公开可用的栽培葡萄基因组的比较分析中,我们在谱系水平上提供了葡萄TEs的全基因组注释。我们公开了至少两个主要的葡萄品种集群可以根据TE含量来鉴定。多个TE家族似乎显着富集或耗尽。此外,计算机模拟和细胞学分析为Aglianico之间几个卫星重复序列的不同染色体分布提供了证据,Falanghina,和其他葡萄。总的来说,我们的数据进一步改善了我们对两种意大利传统品种复杂的葡萄多样性的理解,揭示了迄今为止从未在育种中利用过的独特候选基因库,以提高水果质量。
    Mounting evidence recognizes structural variations (SVs) and repetitive DNA sequences as crucial players in shaping the existing grape phenotypic diversity at intra- and inter-species levels. To deepen our understanding on the abundance, diversity, and distribution of SVs and repetitive DNAs, including transposable elements (TEs) and tandemly repeated satellite DNA (satDNAs), we re-sequenced the genomes of the ancient grapes Aglianico and Falanghina. The analysis of large copy number variants (CNVs) detected candidate polymorphic genes that are involved in the enological features of these varieties. In a comparative analysis of Aglianico and Falanghina sequences with 21 publicly available genomes of cultivated grapes, we provided a genome-wide annotation of grape TEs at the lineage level. We disclosed that at least two main clusters of grape cultivars could be identified based on the TEs content. Multiple TEs families appeared either significantly enriched or depleted. In addition, in silico and cytological analyses provided evidence for a diverse chromosomal distribution of several satellite repeats between Aglianico, Falanghina, and other grapes. Overall, our data further improved our understanding of the intricate grape diversity held by two Italian traditional varieties, unveiling a pool of unique candidate genes never so far exploited in breeding for improved fruit quality.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Review
    下一代测序(NGS)彻底改变了罕见疾病诊断领域。全外显子组和全基因组测序现在常规用于诊断目的;然而,总体诊断率仍低于预期.在这项工作中,我们回顾了目前用于调用和解释人类基因组中种系遗传变异的方法,并讨论了医学遗传学中NGS数据的生物信息学分析中存在的最重要的挑战。我们描述并尝试定量评估剩余的问题,例如参考基因组序列的质量,可重复的覆盖偏差,或基因组复杂区域的变异识别准确性。我们还讨论了转换为完整人类基因组组装或人类泛基因组的前景以及与这种转换相关的重要警告。我们谈到了医学基因组学NGS数据分析中最难的问题,即,遗传变异的注释及其后续解释。我们强调了编码和非编码变体的注释和优先级排序的最具挑战性的方面。最后,我们证明了编码基因组中致病性变异的持续流行,并概述了可能提高基于NGS的疾病诊断效率的研究方向。
    Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    宏观单倍型结合了多种类型的分阶段DNA变异,增加法医鉴别力。高质量的长测序读数,例如,PacBioHiFi阅读,提供数据来检测多倍体和DNA混合物中的大型单倍型。然而,缺乏检测大型单倍型的生物信息学工具。在这项研究中,我们开发了一个生物信息学软件,MacroHapCaller,其中靶向基因座(即,短TRs[STR],单核苷酸多态性,以及插入和缺失)进行基因分型,并与新颖的算法结合以从长读数中调用宏观单倍型。MacroHapCaller使用物理阶段(即,read-backedphasing)toidentifymacrohapliptype,因此它可以检测给定样品的多等位基因大型单倍型。MacroHapCaller通过我们设计的靶向PacBioHiFi测序管道生成的数据进行了验证,在人类基准样品HG002和HG003中测序了有20个核心法医STR基因座的8kb扩增子区域。MacroHapCaller也在全基因组长读数测序数据中得到验证。与已知的基本事实相比,使用MacroHapCaller获得了可靠,准确的基因分型和阶段性的大型单倍型。与现有工具HipSTR和DeepVar相比,MacroHapCaller实现了更高或一致的基因分型准确性和更快的速度。MacroHapCaller能够从高通量测序数据中进行有效的宏观单倍型分析,并支持使用区分宏观单倍型的应用。
    Macrohaplotype combines multiple types of phased DNA variants, increasing forensic discrimination power. High-quality long-sequencing reads, for example, PacBio HiFi reads, provide data to detect macrohaplotypes in multiploidy and DNA mixtures. However, the bioinformatics tools for detecting macrohaplotypes are lacking. In this study, we developed a bioinformatics software, MacroHapCaller, in which targeted loci (i.e., short TRs [STRs], single nucleotide polymorphisms, and insertion and deletions) are genotyped and combined with novel algorithms to call macrohaplotypes from long reads. MacroHapCaller uses physical phasing (i.e., read-backed phasing) to identify macrohaplotypes, and thus it can detect multi-allelic macrohaplotypes for a given sample. MacroHapCaller was validated with data generated from our designed targeted PacBio HiFi sequencing pipeline, which sequenced ∼8-kb amplicon regions harboring 20 core forensic STR loci in human benchmark samples HG002 and HG003. MacroHapCaller also was validated in whole-genome long-read sequencing data. Robust and accurate genotyping and phased macrohaplotypes were obtained with MacroHapCaller compared with the known ground truth. MacroHapCaller achieved a higher or consistent genotyping accuracy and faster speed than existing tools HipSTR and DeepVar. MacroHapCaller enables efficient macrohaplotype analysis from high-throughput sequencing data and supports applications using discriminating macrohaplotypes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号