Variant calling

变体调用
  • 文章类型: Journal Article
    简介:结构变异(SV)是一种变异,可以显着影响表型并导致疾病。因此,SVs的准确检测是现代遗传分析的重要组成部分。长读测序技术的出现开启了一个更准确、更全面的SV调用的新时代,并且已经开发了许多工具来使用长读取数据调用SV。单倍型标记是一种可以在读段上标记单倍型信息的程序,因此可以潜在地改善SV检测;然而,很少有方法利用这些信息。在这篇文章中,我们介绍HapKled,一种新的SV检测工具,可以从牛津纳米孔技术(ONT)长读比对数据中准确检测SV。方法:HapKled通过使用Whatshap对读数进行单倍型标记来利用比对数据的基础单倍型信息,以提高检测性能,具有三个独特的调用机制,包括根据签名的单倍型信息改变聚类条件,基于单倍型信息确定相似的SV,和基于单倍型质量的松弛过滤条件。结果:在我们的评估中,HapKled的性能优于最先进的工具,并且可以在模拟和真实测序数据上提供更好的SV检测结果。HapKled的代码和实验可以从https://github.com/CoREse/HapKled获得。讨论:凭借HapKled可以提供的出色的SV检测性能,HapKled可能在生物信息学研究中有用,临床诊断,和医学研究与开发。
    Introduction: Structural Variants (SVs) are a type of variation that can significantly influence phenotypes and cause diseases. Thus, the accurate detection of SVs is a vital part of modern genetic analysis. The advent of long-read sequencing technology ushers in a new era of more accurate and comprehensive SV calling, and many tools have been developed to call SVs using long-read data. Haplotype-tagging is a procedure that can tag haplotype information on reads and can thus potentially improve the SV detection; nevertheless, few methods make use of this information. In this article, we introduce HapKled, a new SV detection tool that can accurately detect SVs from Oxford Nanopore Technologies (ONT) long-read alignment data. Methods: HapKled utilizes haplotype information underlying alignment data by conducting haplotype-tagging using Whatshap on the reads to improve the detection performance, with three unique calling mechanics including altering clustering conditions according to haplotype information of signatures, determination of similar SVs based on haplotype information, and slack filtering conditions based on haplotype quality. Results: In our evaluations, HapKled outperformed state-of-the-art tools and can deliver better SV detection results on both simulated and real sequencing data. The code and experiments of HapKled can be obtained from https://github.com/CoREse/HapKled. Discussion: With the superb SV detection performance that HapKled can deliver, HapKled could be useful in bioinformatics research, clinical diagnosis, and medical research and development.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在全球范围内,SARS-CoV-2病毒在很长一段时间内没有保持其初始基因型,2020年底首次发布全球关注变种(VOCs)报告。随后,基因组测序已成为表征正在进行的大流行的不可或缺的工具,特别是用于从患者或环境监测中获得的SARS-CoV-2样本的分型。对于这种SARS-CoV-2分型,存在各种体外和计算机工作流程,到目前为止,没有系统的跨平台验证报告.
    结果:在这项工作中,我们提出了第一个全面的跨平台评估和验证silicoSARS-CoV-2分型工作流程。评估依赖于在所有相关的现有技术测序平台上用几种不同的体外方法测序的54个患者来源的样品的数据集。此外,我们介绍UnCoVar,一个健壮的,生产级可重复的SARS-CoV-2分型工作流程,在精确度和召回率方面优于所有其他测试方法。
    结论:在许多方面,SARS-CoV-2大流行加速了技术和分析方法的发展。我们认为,这可以作为应对未来流行病的蓝图。因此,UnCoVar很容易推广到其他病毒病原体和未来的大流行。全自动工作流程从患者样本中组装病毒基因组,识别现有的血统,并提供对个体突变的高分辨率见解。UnCoVar包括广泛的质量控制,并自动生成交互式可视化报告。UnCoVar作为Snakemake工作流实现。开源代码可在github.com/IKIM-Essen/uncovar上获得BSD2条款许可。
    BACKGROUND: At a global scale, the SARS-CoV-2 virus did not remain in its initial genotype for a long period of time, with the first global reports of variants of concern (VOCs) in late 2020. Subsequently, genome sequencing has become an indispensable tool for characterizing the ongoing pandemic, particularly for typing SARS-CoV-2 samples obtained from patients or environmental surveillance. For such SARS-CoV-2 typing, various in vitro and in silico workflows exist, yet to date, no systematic cross-platform validation has been reported.
    RESULTS: In this work, we present the first comprehensive cross-platform evaluation and validation of in silico SARS-CoV-2 typing workflows. The evaluation relies on a dataset of 54 patient-derived samples sequenced with several different in vitro approaches on all relevant state-of-the-art sequencing platforms. Moreover, we present UnCoVar, a robust, production-grade reproducible SARS-CoV-2 typing workflow that outperforms all other tested approaches in terms of precision and recall.
    CONCLUSIONS: In many ways, the SARS-CoV-2 pandemic has accelerated the development of techniques and analytical approaches. We believe that this can serve as a blueprint for dealing with future pandemics. Accordingly, UnCoVar is easily generalizable towards other viral pathogens and future pandemics. The fully automated workflow assembles virus genomes from patient samples, identifies existing lineages, and provides high-resolution insights into individual mutations. UnCoVar includes extensive quality control and automatically generates interactive visual reports. UnCoVar is implemented as a Snakemake workflow. The open-source code is available under a BSD 2-clause license at github.com/IKIM-Essen/uncovar.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    靶标捕获系统与下一代测序的整合已成为探索具有高分辨率的特定遗传区域并促进新等位基因的快速发现的有效工具。尽管取得了这些进步,靶向测序方法的应用,比如myBaits技术,在多倍体燕麦物种中仍然相对未被探索。在这项研究中,我们利用DaicelArborBiosciences提供的myBaits靶标捕获方法来检测变异体,并评估其在燕麦基因组学和育种中变异体检测的可靠性.精心选择了10种燕麦基因型进行靶向测序,专注于染色体2A上的特定区域以检测变异。所选区域包含98个基因。靶向这些区域内的基因的精确设计的诱饵用于靶捕获测序。我们采用了各种映射器和变体调用者来识别变体。在识别变体之后,我们重点研究了通过所有变体调用者鉴定的变体,以评估myBaits测序方法在燕麦育种中的适用性。在我们努力验证已识别的变体时,我们专注于两个SNP,通过基因型KF-318和NOS819111-70中的所有变体调用者鉴定了一个缺失和一个插入,但在其余八个基因型中不存在。靶向SNP的Sanger测序未能重现通过myBaits技术获得的靶标捕获数据。同样,通过高分辨率熔解(HRM)曲线分析验证缺失和插入变体也未能重现靶标捕获数据,再次表明,使用短读取测序进行燕麦基因组变异检测的myBaits靶捕获测序的可靠性存在局限性。这项研究阐明了在采用myBaits目标捕获策略进行燕麦变异检测时谨慎行事的重要性。这项研究为育种者寻求使用myBaits靶标捕获测序来推进燕麦育种工作和标记开发提供了有价值的见解,强调方法测序在燕麦基因组学研究中的重要性。
    The integration of target capture systems with next-generation sequencing has emerged as an efficient tool for exploring specific genetic regions with a high resolution and facilitating the rapid discovery of novel alleles. Despite these advancements, the application of targeted sequencing methodologies, such as the myBaits technology, in polyploid oat species remains relatively unexplored. In this study, we utilized the myBaits target capture method offered by Daicel Arbor Biosciences to detect variants and assess their reliability for variant detection in oat genomics and breeding. Ten oat genotypes were carefully chosen for targeted sequencing, focusing on specific regions on chromosome 2A to detect variants. The selected region harbors 98 genes. Precisely designed baits targeting the genes within these regions were employed for the target capture sequencing. We employed various mappers and variant callers to identify variants. After the identification of variants, we focused on the variants identified via all variants callers to assess the applicability of the myBaits sequencing methodology in oat breeding. In our efforts to validate the identified variants, we focused on two SNPs, one deletion and one insertion identified via all variant callers in the genotypes KF-318 and NOS 819111-70 but absent in the remaining eight genotypes. The Sanger sequencing of targeted SNPs failed to reproduce target capture data obtained through the myBaits technology. Similarly, the validation of deletion and insertion variants via high-resolution melting (HRM) curve analysis also failed to reproduce target capture data, again suggesting limitations in the reliability of the myBaits target capture sequencing using short-read sequencing for variant detection in the oat genome. This study shed light on the importance of exercising caution when employing the myBaits target capture strategy for variant detection in oats. This study provides valuable insights for breeders seeking to advance oat breeding efforts and marker development using myBaits target capture sequencing, emphasizing the significance of methodological sequencing considerations in oat genomics research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    全基因组测序被广泛用于研究感兴趣的生物体中的群体基因组变异。已独立开发了分类工具,以从与参考基因组对齐的短读取测序数据中调用变体,包括单核苷酸多态性(SNP)和结构变异(SV)。我们开发了SNP-SVant,一个综合的,灵活,和计算有效的生物信息学工作流程,可预测生物体中的高置信度SNP和SV,而无需基准变体,传统上用于区分测序错误与真实变体。在没有这些基准数据集的情况下,我们利用多轮统计重新校准来提高变体预测的精度。SNP-SVant工作流程灵活,与用户选项来权衡精度的灵敏度。该工作流程使用基因组分析工具包(GATK)预测SNP和小的插入和删除,并使用基因组重排识别软件套件(GRIDSS)预测SV,,它使用自定义脚本在变体注释中达到顶峰。SNP-SVant的关键效用是其可扩展性。变体调用是一个计算昂贵的过程,因此,SNP-SVant使用具有中间检查点步骤的工作流管理系统,通过最小化冗余计算和省略依赖文件可用的步骤来确保资源的有效利用。SNP-SVant还提供指标来评估所调用变体的质量,并在VCF和对齐的FASTA格式输出之间进行转换,以确保与下游工具的兼容性来计算选择统计信息。这在人口基因组学研究中很常见。通过考虑小型和大型结构变体,该工作流程的用户可以获得感兴趣的生物体中基因组改变的广泛视图。总的来说,这个工作流程提高了我们评估不同类型基因组改变的功能后果的能力,最终提高我们将基因型与表型相关联的能力。©2024作者WileyPeriodicalsLLC出版的当前协议。基本方案:预测单核苷酸多态性和结构变异支持方案1:下载公开可用的测序数据支持方案2:使用整合的基因组查看器可视化变异基因座支持方案3:在VCF和对齐的FASTA格式之间转换。
    Whole-genome sequencing is widely used to investigate population genomic variation in organisms of interest. Assorted tools have been independently developed to call variants from short-read sequencing data aligned to a reference genome, including single nucleotide polymorphisms (SNPs) and structural variations (SVs). We developed SNP-SVant, an integrated, flexible, and computationally efficient bioinformatic workflow that predicts high-confidence SNPs and SVs in organisms without benchmarked variants, which are traditionally used for distinguishing sequencing errors from real variants. In the absence of these benchmarked datasets, we leverage multiple rounds of statistical recalibration to increase the precision of variant prediction. The SNP-SVant workflow is flexible, with user options to tradeoff accuracy for sensitivity. The workflow predicts SNPs and small insertions and deletions using the Genome Analysis ToolKit (GATK) and predicts SVs using the Genome Rearrangement IDentification Software Suite (GRIDSS), and it culminates in variant annotation using custom scripts. A key utility of SNP-SVant is its scalability. Variant calling is a computationally expensive procedure, and thus, SNP-SVant uses a workflow management system with intermediary checkpoint steps to ensure efficient use of resources by minimizing redundant computations and omitting steps where dependent files are available. SNP-SVant also provides metrics to assess the quality of called variants and converts between VCF and aligned FASTA format outputs to ensure compatibility with downstream tools to calculate selection statistics, which are commonplace in population genomics studies. By accounting for both small and large structural variants, users of this workflow can obtain a wide-ranging view of genomic alterations in an organism of interest. Overall, this workflow advances our capabilities in assessing the functional consequences of different types of genomic alterations, ultimately improving our ability to associate genotypes with phenotypes. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol: Predicting single nucleotide polymorphisms and structural variations Support Protocol 1: Downloading publicly available sequencing data Support Protocol 2: Visualizing variant loci using Integrated Genome Viewer Support Protocol 3: Converting between VCF and aligned FASTA formats.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在癌症基因组学中,变体调用已经高级,但是传统的平均准确性评估不足以用于生物标志物,例如肿瘤突变负担,不同样本之间差异很大,影响免疫治疗患者的选择和阈值设置。在这项研究中,我们介绍TMBstable,一种创新的方法,使用元学习框架为特定的基因组区域动态选择最佳的变体调用策略,用统一的全样本策略将其与传统的呼叫者区分开来。该过程从将样本分割为窗口并提取用于聚类的元特征开始,然后使用预训练的元模型为每个集群选择合适的算法,从而解决策略样本不匹配的问题,减少性能波动并确保各种样品的性能一致。我们使用模拟和真实的非小细胞肺癌和鼻咽癌样本评估了TMBstable,将其与高级呼叫者进行比较。评估,以稳定措施为重点,如假阳性率的方差和变异系数,假阴性率,精确度和召回率,涉及300个模拟肿瘤样本和106个真实肿瘤样本。基准结果显示TMBstable具有优异的稳定性,各性能指标的方差和变异系数最低,强调其在分析基于计数的生物标志物方面的有效性。TMBstable算法可以在https://github.com/hello-json/TMBstable访问,仅供学术使用。
    In cancer genomics, variant calling has advanced, but traditional mean accuracy evaluations are inadequate for biomarkers like tumor mutation burden, which vary significantly across samples, affecting immunotherapy patient selection and threshold settings. In this study, we introduce TMBstable, an innovative method that dynamically selects optimal variant calling strategies for specific genomic regions using a meta-learning framework, distinguishing it from traditional callers with uniform sample-wide strategies. The process begins with segmenting the sample into windows and extracting meta-features for clustering, followed by using a pre-trained meta-model to select suitable algorithms for each cluster, thereby addressing strategy-sample mismatches, reducing performance fluctuations and ensuring consistent performance across various samples. We evaluated TMBstable using both simulated and real non-small cell lung cancer and nasopharyngeal carcinoma samples, comparing it with advanced callers. The assessment, focusing on stability measures, such as the variance and coefficient of variation in false positive rate, false negative rate, precision and recall, involved 300 simulated and 106 real tumor samples. Benchmark results showed TMBstable\'s superior stability with the lowest variance and coefficient of variation across performance metrics, highlighting its effectiveness in analyzing the counting-based biomarker. The TMBstable algorithm can be accessed at https://github.com/hello-json/TMBstable for academic usage only.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对临床样本进行基因组测序以鉴定SARS-CoV-2的新变体,一直是遏制病毒传播的关键公共卫生工具。因此,在COVID-19大流行期间,对数量空前的SARS-CoV-2基因组进行了测序,可以快速鉴定遗传变异,能够及时设计和测试疗法,并部署新的疫苗配方,以对抗新的变种。然而,尽管深度测序的技术进步,对全球生成的原始序列数据的分析既不标准化也不一致,导致可能影响变体鉴定的完全不同的序列。这里,我们表明,对于Illumina和Oxford纳米孔测序平台,工业使用的下游生物信息学协议,政府,和学术团体从同一样本中得出不同的病毒序列。这些生物信息学工作流程产生了单核苷酸多态性差异的共有基因组,插入的包含和排除,和/或删除,尽管使用相同的原始序列作为输入数据集。这里,我们比较和表征了这种差异,并提出了一套具体的参数和协议,应在整个领域采用。生物信息学工作流程的一致结果是SARS-CoV-2和未来病原体监测工作的基础,包括大流行准备,允许数据驱动和及时的公共卫生响应。
    Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    结构变异体(SV)是一种重要类型的遗传变异体,可以显着影响表型。因此,SVs的鉴定是现代基因组分析的重要组成部分。在这篇文章中,我们介绍Kled,一个超快速和敏感的SV调用长读测序数据给出了一个特殊设计的方法与一个新的签名合并算法,自定义细化策略和高性能程序结构。评估结果表明,与针对不同平台和测序深度的模拟和真实长读数数据的几种最新方法相比,kled可以实现最佳SV调用。此外,kled擅长快速SV调用,可以有效地利用多个中央处理器(CPU)内核,同时保持低内存使用率。kled的源代码可以从https://github.com/CoREse/kled获得。
    Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Review
    下一代测序(NGS)彻底改变了罕见疾病诊断领域。全外显子组和全基因组测序现在常规用于诊断目的;然而,总体诊断率仍低于预期.在这项工作中,我们回顾了目前用于调用和解释人类基因组中种系遗传变异的方法,并讨论了医学遗传学中NGS数据的生物信息学分析中存在的最重要的挑战。我们描述并尝试定量评估剩余的问题,例如参考基因组序列的质量,可重复的覆盖偏差,或基因组复杂区域的变异识别准确性。我们还讨论了转换为完整人类基因组组装或人类泛基因组的前景以及与这种转换相关的重要警告。我们谈到了医学基因组学NGS数据分析中最难的问题,即,遗传变异的注释及其后续解释。我们强调了编码和非编码变体的注释和优先级排序的最具挑战性的方面。最后,我们证明了编码基因组中致病性变异的持续流行,并概述了可能提高基于NGS的疾病诊断效率的研究方向。
    Next-generation sequencing (NGS) has revolutionized the field of rare disease diagnostics. Whole exome and whole genome sequencing are now routinely used for diagnostic purposes; however, the overall diagnosis rate remains lower than expected. In this work, we review current approaches used for calling and interpretation of germline genetic variants in the human genome, and discuss the most important challenges that persist in the bioinformatic analysis of NGS data in medical genetics. We describe and attempt to quantitatively assess the remaining problems, such as the quality of the reference genome sequence, reproducible coverage biases, or variant calling accuracy in complex regions of the genome. We also discuss the prospects of switching to the complete human genome assembly or the human pan-genome and important caveats associated with such a switch. We touch on arguably the hardest problem of NGS data analysis for medical genomics, namely, the annotation of genetic variants and their subsequent interpretation. We highlight the most challenging aspects of annotation and prioritization of both coding and non-coding variants. Finally, we demonstrate the persistent prevalence of pathogenic variants in the coding genome, and outline research directions that may enhance the efficiency of NGS-based disease diagnostics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    外显子组测序(ES)是许多罕见的单基因疾病推荐的一级诊断测试。它允许在单个测试中检测基因组编码外显子区域中的单核苷酸变体(SNV)和拷贝数变体(CNV)。这种双重分析是一种有价值的方法,特别是在有限的资源设置中。单核苷酸变体已被充分研究;然而,将拷贝数变体分析工具纳入变体调用管道尚未作为常规诊断测试来实施,和染色体微阵列仍然更广泛地用于检测拷贝数变异。研究表明,结合单一和拷贝数变异分析可以导致高达58%的诊断产量,从单核苷酸变体仅管道中增加多达18%的产量。重要的是,这是通过仅考虑计算成本来实现的,不会产生任何额外的测序成本。这个小型审查提供了从外显子组数据中进行拷贝数变异分析的概述,以及当前对此类分析的建议。我们还概述了在资源有限的环境中罕见的单基因疾病研究标准实践。我们提供的证据表明,将拷贝数变异检测工具整合到标准外显子组测序分析管道中可以提高诊断产量,应该被认为是一个显著有益的补充。具有相对低成本的影响。在代表性不足的人群和有限的资源环境中的常规实施将促进CNV数据集的生成和共享,并为在基因组医学中建立这一利基的核心中心提供动力。
    Exome sequencing (ES) is a recommended first-tier diagnostic test for many rare monogenic diseases. It allows for the detection of both single-nucleotide variants (SNVs) and copy number variants (CNVs) in coding exonic regions of the genome in a single test, and this dual analysis is a valuable approach, especially in limited resource settings. Single-nucleotide variants are well studied; however, the incorporation of copy number variant analysis tools into variant calling pipelines has not been implemented yet as a routine diagnostic test, and chromosomal microarray is still more widely used to detect copy number variants. Research shows that combined single and copy number variant analysis can lead to a diagnostic yield of up to 58%, increasing the yield with as much as 18% from the single-nucleotide variant only pipeline. Importantly, this is achieved with the consideration of computational costs only, without incurring any additional sequencing costs. This mini review provides an overview of copy number variant analysis from exome data and what the current recommendations are for this type of analysis. We also present an overview on rare monogenic disease research standard practices in resource-limited settings. We present evidence that integrating copy number variant detection tools into a standard exome sequencing analysis pipeline improves diagnostic yield and should be considered a significantly beneficial addition, with relatively low-cost implications. Routine implementation in underrepresented populations and limited resource settings will promote generation and sharing of CNV datasets and provide momentum to build core centers for this niche within genomic medicine.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:准确检测变异体对基于基因组学的研究至关重要。目前,有各种工具设计来检测基因组变异,然而,决定使用哪种工具一直是一个挑战,特别是当各种主要的基因组计划选择使用不同的工具时。到目前为止,大多数现有工具主要是为处理短读数据而开发的(即,Illumina);然而,其他测序技术(例如PacBio,和牛津纳米孔)最近表明,它们也可以用于变体调用。此外,随着基于人工智能(AI)的变体调用工具的出现,迫切需要在效率方面比较这些工具,准确度,计算能力,和易用性。
    结果:在这项研究中,我们评估了五种最广泛使用的传统和基于AI的变体调用工具(BCFTools,GATK4鸭嘴兽,DNAscope,和DeepVariant)在准确性和计算成本方面,使用来自三种不同测序技术(Illumina,PacBioHiFi,和ONT)用于来自“瓶子中的基因组”项目的同一组样品。分析表明,基于AI的变体调用工具取代了传统的工具,在大多数方面使用长读段和短读段调用SNV和INDEL。此外,我们展示了每个工具的优缺点,同时在这些比较的每个方面对它们进行排名。
    结论:本研究提供了使用基于AI的和常规的具有不同类型测序数据的变体调用的最佳实践。
    BACKGROUND: The accurate detection of variants is essential for genomics-based studies. Currently, there are various tools designed to detect genomic variants, however, it has always been a challenge to decide which tool to use, especially when various major genome projects have chosen to use different tools. Thus far, most of the existing tools were mainly developed to work on short-read data (i.e., Illumina); however, other sequencing technologies (e.g. PacBio, and Oxford Nanopore) have recently shown that they can also be used for variant calling. In addition, with the emergence of artificial intelligence (AI)-based variant calling tools, there is a pressing need to compare these tools in terms of efficiency, accuracy, computational power, and ease of use.
    RESULTS: In this study, we evaluated five of the most widely used conventional and AI-based variant calling tools (BCFTools, GATK4, Platypus, DNAscope, and DeepVariant) in terms of accuracy and computational cost using both short-read and long-read data derived from three different sequencing technologies (Illumina, PacBio HiFi, and ONT) for the same set of samples from the Genome In A Bottle project. The analysis showed that AI-based variant calling tools supersede conventional ones for calling SNVs and INDELs using both long and short reads in most aspects. In addition, we demonstrate the advantages and drawbacks of each tool while ranking them in each aspect of these comparisons.
    CONCLUSIONS: This study provides best practices for variant calling using AI-based and conventional variant callers with different types of sequencing data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号