Variant calling

变体调用
  • 文章类型: Journal Article
    简介:结构变异(SV)是一种变异,可以显着影响表型并导致疾病。因此,SVs的准确检测是现代遗传分析的重要组成部分。长读测序技术的出现开启了一个更准确、更全面的SV调用的新时代,并且已经开发了许多工具来使用长读取数据调用SV。单倍型标记是一种可以在读段上标记单倍型信息的程序,因此可以潜在地改善SV检测;然而,很少有方法利用这些信息。在这篇文章中,我们介绍HapKled,一种新的SV检测工具,可以从牛津纳米孔技术(ONT)长读比对数据中准确检测SV。方法:HapKled通过使用Whatshap对读数进行单倍型标记来利用比对数据的基础单倍型信息,以提高检测性能,具有三个独特的调用机制,包括根据签名的单倍型信息改变聚类条件,基于单倍型信息确定相似的SV,和基于单倍型质量的松弛过滤条件。结果:在我们的评估中,HapKled的性能优于最先进的工具,并且可以在模拟和真实测序数据上提供更好的SV检测结果。HapKled的代码和实验可以从https://github.com/CoREse/HapKled获得。讨论:凭借HapKled可以提供的出色的SV检测性能,HapKled可能在生物信息学研究中有用,临床诊断,和医学研究与开发。
    Introduction: Structural Variants (SVs) are a type of variation that can significantly influence phenotypes and cause diseases. Thus, the accurate detection of SVs is a vital part of modern genetic analysis. The advent of long-read sequencing technology ushers in a new era of more accurate and comprehensive SV calling, and many tools have been developed to call SVs using long-read data. Haplotype-tagging is a procedure that can tag haplotype information on reads and can thus potentially improve the SV detection; nevertheless, few methods make use of this information. In this article, we introduce HapKled, a new SV detection tool that can accurately detect SVs from Oxford Nanopore Technologies (ONT) long-read alignment data. Methods: HapKled utilizes haplotype information underlying alignment data by conducting haplotype-tagging using Whatshap on the reads to improve the detection performance, with three unique calling mechanics including altering clustering conditions according to haplotype information of signatures, determination of similar SVs based on haplotype information, and slack filtering conditions based on haplotype quality. Results: In our evaluations, HapKled outperformed state-of-the-art tools and can deliver better SV detection results on both simulated and real sequencing data. The code and experiments of HapKled can be obtained from https://github.com/CoREse/HapKled. Discussion: With the superb SV detection performance that HapKled can deliver, HapKled could be useful in bioinformatics research, clinical diagnosis, and medical research and development.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在癌症基因组学中,变体调用已经高级,但是传统的平均准确性评估不足以用于生物标志物,例如肿瘤突变负担,不同样本之间差异很大,影响免疫治疗患者的选择和阈值设置。在这项研究中,我们介绍TMBstable,一种创新的方法,使用元学习框架为特定的基因组区域动态选择最佳的变体调用策略,用统一的全样本策略将其与传统的呼叫者区分开来。该过程从将样本分割为窗口并提取用于聚类的元特征开始,然后使用预训练的元模型为每个集群选择合适的算法,从而解决策略样本不匹配的问题,减少性能波动并确保各种样品的性能一致。我们使用模拟和真实的非小细胞肺癌和鼻咽癌样本评估了TMBstable,将其与高级呼叫者进行比较。评估,以稳定措施为重点,如假阳性率的方差和变异系数,假阴性率,精确度和召回率,涉及300个模拟肿瘤样本和106个真实肿瘤样本。基准结果显示TMBstable具有优异的稳定性,各性能指标的方差和变异系数最低,强调其在分析基于计数的生物标志物方面的有效性。TMBstable算法可以在https://github.com/hello-json/TMBstable访问,仅供学术使用。
    In cancer genomics, variant calling has advanced, but traditional mean accuracy evaluations are inadequate for biomarkers like tumor mutation burden, which vary significantly across samples, affecting immunotherapy patient selection and threshold settings. In this study, we introduce TMBstable, an innovative method that dynamically selects optimal variant calling strategies for specific genomic regions using a meta-learning framework, distinguishing it from traditional callers with uniform sample-wide strategies. The process begins with segmenting the sample into windows and extracting meta-features for clustering, followed by using a pre-trained meta-model to select suitable algorithms for each cluster, thereby addressing strategy-sample mismatches, reducing performance fluctuations and ensuring consistent performance across various samples. We evaluated TMBstable using both simulated and real non-small cell lung cancer and nasopharyngeal carcinoma samples, comparing it with advanced callers. The assessment, focusing on stability measures, such as the variance and coefficient of variation in false positive rate, false negative rate, precision and recall, involved 300 simulated and 106 real tumor samples. Benchmark results showed TMBstable\'s superior stability with the lowest variance and coefficient of variation across performance metrics, highlighting its effectiveness in analyzing the counting-based biomarker. The TMBstable algorithm can be accessed at https://github.com/hello-json/TMBstable for academic usage only.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    结构变异体(SV)是一种重要类型的遗传变异体,可以显着影响表型。因此,SVs的鉴定是现代基因组分析的重要组成部分。在这篇文章中,我们介绍Kled,一个超快速和敏感的SV调用长读测序数据给出了一个特殊设计的方法与一个新的签名合并算法,自定义细化策略和高性能程序结构。评估结果表明,与针对不同平台和测序深度的模拟和真实长读数数据的几种最新方法相比,kled可以实现最佳SV调用。此外,kled擅长快速SV调用,可以有效地利用多个中央处理器(CPU)内核,同时保持低内存使用率。kled的源代码可以从https://github.com/CoREse/kled获得。
    Structural Variants (SVs) are a crucial type of genetic variant that can significantly impact phenotypes. Therefore, the identification of SVs is an essential part of modern genomic analysis. In this article, we present kled, an ultra-fast and sensitive SV caller for long-read sequencing data given the specially designed approach with a novel signature-merging algorithm, custom refinement strategies and a high-performance program structure. The evaluation results demonstrate that kled can achieve optimal SV calling compared to several state-of-the-art methods on simulated and real long-read data for different platforms and sequencing depths. Furthermore, kled excels at rapid SV calling and can efficiently utilize multiple Central Processing Unit (CPU) cores while maintaining low memory usage. The source code for kled can be obtained from https://github.com/CoREse/kled.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因组测序数据在个性化医疗和诊断领域变得越来越重要。然而,准确检测基因组变异仍然是一项具有挑战性的任务。传统的变化检测方法依赖于人工检查或预定义的规则,这可能是耗时且容易出错的。因此,基于深度学习的变异检测方法由于能够自动学习区分变异的基因组特征而受到关注。在我们的审查中,我们讨论了基于深度学习的算法的最新进展,用于检测基因组数据中的小变化和结构变化,以及它们的优点和局限性。
    Genome sequencing data have become increasingly important in the field of personalized medicine and diagnosis. However, accurately detecting genomic variations remains a challenging task. Traditional variation detection methods rely on manual inspection or predefined rules, which can be time-consuming and prone to errors. Consequently, deep learning-based approaches for variation detection have gained attention due to their ability to automatically learn genomic features that distinguish between variants. In our review, we discuss the recent advancements in deep learning-based algorithms for detecting small variations and structural variations in genomic data, as well as their advantages and limitations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    FDA真相挑战V2旨在评估具有挑战性的基因组区域中变体调用的最新技术。从FASTQ开始,20名挑战参与者应用了他们的变体调用管道,并为一种或多种测序技术提交了64个变体调用集(Illumina,PacBioHiFi,和牛津纳米孔技术)。根据在瓶中使用更新的基因组基准集和基因组分层对小变体进行基准测试的最佳实践来评估提交。挑战提交包括许多创新方法,使用基于图形和机器学习的方法,对短读和长读数据集评分最好,分别。有了机器学习方法,结合多种测序技术表现特别好。测序和变异识别的最新发展使具有挑战性的基因组区域的基准变异成为可能。为鉴定以前未知的临床相关变异铺平了道路。
    The precisionFDA Truth Challenge V2 aimed to assess the state of the art of variant calling in challenging genomic regions. Starting with FASTQs, 20 challenge participants applied their variant-calling pipelines and submitted 64 variant call sets for one or more sequencing technologies (Illumina, PacBio HiFi, and Oxford Nanopore Technologies). Submissions were evaluated following best practices for benchmarking small variants with updated Genome in a Bottle benchmark sets and genome stratifications. Challenge submissions included numerous innovative methods, with graph-based and machine learning methods scoring best for short-read and long-read datasets, respectively. With machine learning approaches, combining multiple sequencing technologies performed particularly well. Recent developments in sequencing and variant calling have enabled benchmarking variants in challenging genomic regions, paving the way for the identification of previously unknown clinically relevant variants.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    这项研究讨论了与经济上重要的性状显着相关的基因突变,这将使茶育种者受益。目的分析20个不含氮肥生长的突变基因型茶树种质的叶片品质和品质相关基因中的SNPs。叶N含量,儿茶素,L-茶氨酸,通过HPLC分析干叶中的咖啡因含量。此外,光化学产量,电子传输效率,使用PAM荧光法分析了非光化学猝灭。下一代合并扩增子测序方法用于与N代谢和叶片质量相关的30个关键基因中的SNP调用。基因型之间的叶片N含量显着变化(p≤0.05),占干重的2.3%至3.7%。咖啡因含量从0.7到11.7mgg-1不等,L-茶氨酸含量从0.2到5.8mgg-1干叶质量不等。氮含量与茶氨酸等生化指标呈显著正相关,咖啡因,和大多数儿茶素。然而,在光合参数之间观察到显著的负相关(Y,ETR,Fv/Fm)和几种生化化合物,包括芦丁,槲皮素-3-O-葡萄糖苷,山奈酚-3-O-鲁丁苷,山奈酚-3-O-葡萄糖苷,茶黄素-3'-没食子酸盐,没食子酸.根据我们的SNP分析,在所有基因型中均检测到WRKY57中的三个SNP,N含量较低。此外,29个具有高或中等效果的SNP对#316具有特异性(高N含量,高质量)或#507(低N含量,低质量)。使用线性回归模型揭示了16个显著的关联;茶黄素,L-茶氨酸,和ECG与以下基因的几个SNP相关:ANSa,DFRa,GDH2,4CL,AlaAT1,MYB4,LHT1,F3\'5\'Hb,UFGTa.其中,七个中等影响的SNP导致以下基因的最终蛋白质中氨基酸含量的变化:ANSa,GDH2,4Cl,F3\'5\'Hb,UFGTa.这些结果将有助于进一步评估重要的SNP,并有助于更好地了解树木作物中氮吸收效率的机制。
    This study discusses the genetic mutations that have a significant association with economically important traits that would benefit tea breeders. The purpose of this study was to analyze the leaf quality and SNPs in quality-related genes in the tea plant collection of 20 mutant genotypes growing without nitrogen fertilizers. Leaf N-content, catechins, L-theanine, and caffeine contents were analyzed in dry leaves via HPLC. Additionally, the photochemical yield, electron transport efficiency, and non-photochemical quenching were analyzed using PAM-fluorimetry. The next generation pooled amplicon-sequencing approach was used for SNPs-calling in 30 key genes related to N metabolism and leaf quality. The leaf N content varied significantly among genotypes (p ≤ 0.05) from 2.3 to 3.7% of dry mass. The caffeine content varied from 0.7 to 11.7 mg g-1, and the L-theanine content varied from 0.2 to 5.8 mg g-1 dry leaf mass. Significant positive correlations were detected between the nitrogen content and biochemical parameters such as theanine, caffeine, and most of the catechins. However, significant negative correlations were observed between the photosynthetic parameters (Y, ETR, Fv/Fm) and several biochemical compounds, including rutin, Quercetin-3-O-glucoside, Kaempferol-3-O-rutinoside, Kaempferol-3-O-glucoside, Theaflavin-3\'-gallate, gallic acid. From our SNP-analysis, three SNPs in WRKY57 were detected in all genotypes with a low N content. Moreover, 29 SNPs with a high or moderate effect were specific for #316 (high N-content, high quality) or #507 (low N-content, low quality). The use of a linear regression model revealed 16 significant associations; theaflavin, L-theanine, and ECG were associated with several SNPs of the following genes: ANSa, DFRa, GDH2, 4CL, AlaAT1, MYB4, LHT1, F3\'5\'Hb, UFGTa. Among them, seven SNPs of moderate effect led to changes in the amino acid contents in the final proteins of the following genes: ANSa, GDH2, 4Cl, F3\'5\'Hb, UFGTa. These results will be useful for further evaluations of the important SNPs and will help to provide a better understanding of the mechanisms of nitrogen uptake efficiency in tree crops.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:随着第三代测序技术的不断进步和下一代测序技术的可负担性不断提高,来自不同测序技术平台的测序数据变得越来越普遍。虽然已经进行了许多基准研究来比较不同平台和方法中的变体调用性能,很少关注利用不同平台的优势来优化整体性能的潜力,特别是整合牛津纳米孔和Illumina测序数据。
    结果:我们通过精心设计的基于深度学习的变体调用程序Clair3-MP(多平台)的实验,研究了多平台数据对变体调用性能的影响。通过我们的研究,我们不仅展示了ONT-Illumina数据改进变体调用的能力,而且还确定了利用ONT-Illumina数据的最佳方案。此外,我们发现,使用ONT-Illumina数据的变体调用的改进来自于困难基因组区域的改进,例如大型低复杂度区域以及分段和崩溃重复区域。此外,Clair3-MP可以结合参考基因组分层信息,以实现变体识别的小但可测量的改善。Clair3-MP可以作为开源项目访问:https://github.com/HKU-BAL/Clair3-MP。
    结论:这些见解对研究人员和从业人员都具有重要意义,为提高基因组分析在各种应用中的可靠性和效率提供有价值的指导。
    BACKGROUND: With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms is becoming more common. While numerous benchmarking studies have been conducted to compare variant-calling performance across different platforms and approaches, little attention has been paid to the potential of leveraging the strengths of different platforms to optimize overall performance, especially integrating Oxford Nanopore and Illumina sequencing data.
    RESULTS: We investigated the impact of multi-platform data on the performance of variant calling through carefully designed experiments with a deep learning-based variant caller named Clair3-MP (Multi-Platform). Through our research, we not only demonstrated the capability of ONT-Illumina data for improved variant calling, but also identified the optimal scenarios for utilizing ONT-Illumina data. In addition, we revealed that the improvement in variant calling using ONT-Illumina data comes from an improvement in difficult genomic regions, such as the large low-complexity regions and segmental and collapse duplication regions. Moreover, Clair3-MP can incorporate reference genome stratification information to achieve a small but measurable improvement in variant calling. Clair3-MP is accessible as an open-source project at: https://github.com/HKU-BAL/Clair3-MP .
    CONCLUSIONS: These insights have important implications for researchers and practitioners alike, providing valuable guidance for improving the reliability and efficiency of genomic analysis in diverse applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Delins,被称为复杂的indel,是通过在共同的基因组位置删除和插入DNA片段而形成的组合基因组结构变异。最近的研究强调了delins在癌症诊断和治疗中的重要性。尽管来自PacBioCLR测序的长读数显着促进了delins调用,现有的方法仍然遇到来自高水平测序错误的计算挑战,并且经常在基因分型和定相缺失中引入错误。在本文中,我们提出了一个有效的算法管道,名为delInsCaller,从PacBioCLR测序数据中确定单倍型分辨率的缺失。delInsCaller通过计算变化密度分数来设计容错方法,这有助于在高水平的测序错误下定位候选突变区域。它采用基于碱基关联的重叠群剪接方法,这有助于在存在假阳性干扰的情况下进行重叠群剪接。我们在模拟数据集上进行了一系列实验,结果表明,delInsCaller的性能优于几种最先进的方法,例如,SVseq3,在广泛的参数设置,比如阅读深度,测序错误率,等。delInsCaller通常比其他方法获得更高的f-measures;具体地说,它能够在约15%的测序误差下保持优势。与现有方法相比,delInsCaller能够显着提高N50值,几乎没有单倍型准确性的损失。
    Delins, as known as complex indel, is a combined genomic structural variation formed by deleting and inserting DNA fragments at a common genomic location. Recent studies emphasized the importance of delins in cancer diagnosis and treatment. Although the long reads from PacBio CLR sequencing significantly facilitate delins calling, the existing approaches still encounter computational challenges from the high level of sequencing errors, and often introduce errors in genotyping and phasing delins. In this paper, we propose an efficient algorithmic pipeline, named delInsCaller, to identify delins on haplotype resolution from the PacBio CLR sequencing data. delInsCaller design a fault-tolerant method by calculating a variation density score, which helps to locate the candidate mutational regions under a high-level of sequencing errors. It adopts a base association-based contig splicing method, which facilitates contig splicing in the presence of false-positive interference. We conducted a series of experiments on simulated datasets, and the results showed that delInsCaller outperformed several state-of-the-art approaches, e.g., SVseq3, across a wide range of parameter settings, such as read depth, sequencing error rates, etc. delInsCaller often obtained higher f-measures than other approaches; specifically, it was able to maintain advantages at ~15% sequencing errors. delInsCaller was able to significantly improve the N50 values with almost no loss of haplotype accuracy compared with the existing approach as well.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    从家庭儿童-母亲-父亲三重奏测序数据中准确识别遗传变异在基因组学中很重要。然而,最先进的方法将来自三重奏的变体调用视为三个独立的任务,这限制了他们对纳米孔长读取测序数据的调用准确性。为了更好的三重奏变体调用,我们介绍Clair3-Trio,为来自Nanopore长读的家庭三重奏数据量身定制的第一个变体调用者。Clair3-Trio采用了Trio-to-Trio深度神经网络模型,这允许它在单个模型中输入三重奏测序信息并输出三重奏的所有预测变体,以改善变体调用。我们还介绍了MCVLoss,为三重奏中的变体调用量身定制的新型损失函数,利用孟德尔继承的显式编码。Clair3-Trio在实验中表现出全面的改进。与当前最先进的方法相比,它预测的孟德尔继承违规变化要少得多。我们还证明了我们的Trio-to-Trio模型比竞争架构更准确。Clair3-Trio是免费的,开源项目https://github.com/HKU-BAL/Clair3-Trio.
    Accurate identification of genetic variants from family child-mother-father trio sequencing data is important in genomics. However, state-of-the-art approaches treat variant calling from trios as three independent tasks, which limits their calling accuracy for Nanopore long-read sequencing data. For better trio variant calling, we introduce Clair3-Trio, the first variant caller tailored for family trio data from Nanopore long-reads. Clair3-Trio employs a Trio-to-Trio deep neural network model, which allows it to input the trio sequencing information and output all of the trio\'s predicted variants within a single model to improve variant calling. We also present MCVLoss, a novel loss function tailor-made for variant calling in trios, leveraging the explicit encoding of the Mendelian inheritance. Clair3-Trio showed comprehensive improvement in experiments. It predicted far fewer Mendelian inheritance violation variations than current state-of-the-art methods. We also demonstrated that our Trio-to-Trio model is more accurate than competing architectures. Clair3-Trio is accessible as a free, open-source project at https://github.com/HKU-BAL/Clair3-Trio.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    世界各地土著保存的地方品种在表型和适应不同环境方面表现出较大的差异,这表明它们包含丰富的资源,可以作为水稻改良的基因库。尽管对栽培稻进行了广泛的研究,地方品种和现代栽培稻之间的变化和关系尚不清楚。在这项研究中,对总共20个品种进行了基因分型,其中包括10个来自全球不同国家的爪哇稻和10个来自中国的in稻,并产生了99.9GB的重测序原始数据。以粳稻品种Nipponbare的基因组序列为参考,以下单核苷酸多态性(SNP)的遗传特征范围为861,177至1,044,617,插入-缺失多态性(InDels)范围为164,018至211,135,结构变异(SV)范围为3,313至4,959。还确定了两个亚种之间的差异,即584,104个SNP,75,351InDels,104,606个SNP,和19,872种InDels,分别。此外,javanica特定SNP相关基因的基因本体论(GO)和KEGG揭示了它们参与DNA代谢过程,DNA复制,和DNA整合。通过Fst和扫描选择性分析鉴定序列变异和候选籽粒形状相关基因TGW2。TGW2的Hap4比其他的表现更好。本研究中说明的全基因组序列数据和遗传变异信息将作为分子育种的重要基因库,并有助于对javanicaOryza品种进行遗传分析。
    The landraces preserved by indigenous worldwide exhibited larger variation in the phenotypes and adaption to different environments, which suggests that they comprise rich resources and can be served as a gene pool for rice improvement. Despite extensive studies on cultivated rice, the variations and relationships between landraces and modern cultivated rice remain unclear. In this study, a total of 20 varieties that include 10 Oryza javanica collected from different countries worldwide and 10 Oryza indica from China were genotyped and yielded a sum of 99.9-Gb resequencing raw data. With the genomic sequence of the japonica cultivar Nipponbare as a reference, the following genetic features of single-nucleotide polymorphism (SNP) ranged from 861,177 to 1,044,617, insertion-deletion polymorphisms (InDels) ranged from 164,018 to 211,135, and structural variation (SV) ranged from 3,313 to 4,959 were identified in Oryza javanica. Variation between the two subspecies was also determined that 584,104 SNPs, 75,351 InDels, 104,606 SNPs, and 19,872 InDels specific to Oryza indica and Oryza javanica, respectively. Furthermore, Gene Ontology (GO) and KEGG of Oryza javanica-specific SNP-related genes revealed that they participated in DNA metabolic process, DNA replication, and DNA integration. The sequence variation and candidate grain shape-related gene TGW2 were identified through Fst and sweep selective analysis. Hap4 of TGW2 is performed better than others. The whole genome sequence data and genetic variation information illustrated in this study will serve as an important gene pool for molecular breeding and facilitate genetic analysis of Oryza javanica varieties.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号