variant effect prediction

变异效应预测
  • 文章类型: Journal Article
    背景:已经开发了许多深度学习模型来预测表观遗传特征,例如DNA序列的染色质可及性。模型评估通常报告全基因组的性能;然而,顺式监管要素(CREs),在基因调控中起关键作用,只占基因组的一小部分。此外,细胞类型特异性CREs含有很大比例的复杂疾病遗传力。
    结果:我们评估了具有不同程度细胞类型特异性的染色质可及性区域的基因组深度学习模型。我们评估了该领域的两个建模方向:跨数千个输出(细胞类型和表观遗传标记)训练的通用模型以及针对特定组织和任务量身定制的模型。我们发现基因组深度学习模型的准确性,包括两个最先进的通用模型-Enformer和Sei-在整个基因组中变化,并且在细胞类型特定的可接近区域中减少。使用在特定组织的细胞类型上训练的可接近性模型,我们发现,通过单任务学习或高容量多任务模型,增加模型学习细胞类型特定的调控语法的能力,可以提高细胞类型特定的可访问区域的性能.我们还观察到,改善参考序列预测并不能始终如一地改善变异效应预测,这表明需要新的策略来提高变体的性能。
    结论:我们的结果为基因组深度学习模型的性能提供了新的视角,显示性能在整个基因组中变化,并且在细胞类型特定的可接近区域中特别降低。我们还确定了在细胞类型特定的可访问区域中最大化性能的策略。
    BACKGROUND: A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type-specific CREs contain a large proportion of complex disease heritability.
    RESULTS: We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks) and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models-Enformer and Sei-varies across the genome and is reduced in cell type-specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type-specific regulatory syntax-through single-task learning or high capacity multi-task models-can improve performance in cell type-specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants.
    CONCLUSIONS: Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type-specific accessible regions. We also identify strategies to maximize performance in cell type-specific accessible regions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:已经开发了许多深度学习模型来预测表观遗传特征,例如DNA序列的染色质可及性。模型评估通常报告全基因组的性能;然而,顺式监管要素(CREs),在基因调控中起关键作用,只占基因组的一小部分。此外,细胞类型特异性CREs含有很大比例的复杂疾病遗传力。结果:我们评估了具有不同程度细胞类型特异性的染色质可及性区域的基因组深度学习模型。我们评估了该领域的两个建模方向:跨数千个输出(细胞类型和表观遗传标记)训练的通用模型,以及针对特定组织和任务量身定制的模型。我们发现基因组深度学习模型的准确性,包括两个最先进的通用模型-Enformer和Sei-在整个基因组中变化,并且在细胞类型特定的可接近区域中减少。使用在特定组织的细胞类型上训练的可接近性模型,我们发现,通过单任务学习或高容量多任务模型,增加模型学习细胞类型特定的调节语法的能力可以提高细胞类型特定的可访问区域的性能。我们还观察到,改善参考序列预测并不能始终如一地改善变异效应预测,这表明需要新的策略来提高变体的性能。结论:我们的结果为基因组深度学习模型的性能提供了新的视角,显示性能在整个基因组中变化,并且在细胞类型特定的可接近区域中特别降低。我们还确定了在细胞类型特定的可访问区域中最大化性能的策略。
    UNASSIGNED: A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type specific CREs contain a large proportion of complex disease heritability.
    UNASSIGNED: We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks), and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models - Enformer and Sei - varies across the genome and is reduced in cell type specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type specific regulatory syntax - through single-task learning or high capacity multi-task models - can improve performance in cell type specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants.
    UNASSIGNED: Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type specific accessible regions. We also identify strategies to maximize performance in cell type specific accessible regions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    考虑到它们在疾病诊断和驱动分子发现中的使用增加,对预测变异效应的计算工具进行严格评估非常重要。在第六版的关键基因组解释评估(CAGI)挑战,28个STK11罕见变体的数据集(27个错觉,1个单氨基酸缺失),在原发性非小细胞肺癌活检中发现,进行了实验分析,以表征来自四个参与团队和五个公开可用工具的计算方法。预测器在关键评估指标上表现出高水平的表现,测量与测定输出的相关性并将功能丧失(LoF)变体与野生型样(WT样)变体分离。最好的参与者模型,3Cnet,与知名工具进行竞争。这一挑战的独特之处在于功能数据是通过生物学和技术复制生成的,从而使评估人员能够根据实验的变异性来真实地建立最大的预测性能。五个公开可用的工具和3Cnet中的三个在分离LoF变体与WT样变体中接近测定重复的性能。令人惊讶的是,REVEL,一个经常使用的模型,与实验重复所看到的实际值测定输出具有相当的相关性。通过将新的功能证据与计算和群体数据证据相结合来进行变体解释,导致16种新的变体接受了可能的致病性(LP)或可能的良性(LB)的临床可操作分类。总的来说,STK11挑战强调了变异效应预测因子在生物医学科学中的实用性,并为推动计算基因组解释领域的研究提供了令人鼓舞的结果.
    Critical evaluation of computational tools for predicting variant effects is important considering their increased use in disease diagnosis and driving molecular discoveries. In the sixth edition of the Critical Assessment of Genome Interpretation (CAGI) challenge, a dataset of 28 STK11 rare variants (27 missense, 1 single amino acid deletion), identified in primary non-small cell lung cancer biopsies, was experimentally assayed to characterize computational methods from four participating teams and five publicly available tools. Predictors demonstrated a high level of performance on key evaluation metrics, measuring correlation with the assay outputs and separating loss-of-function (LoF) variants from wildtype-like (WT-like) variants. The best participant model, 3Cnet, performed competitively with well-known tools. Unique to this challenge was that the functional data was generated with both biological and technical replicates, thus allowing the assessors to realistically establish maximum predictive performance based on experimental variability. Three out of the five publicly available tools and 3Cnet approached the performance of the assay replicates in separating LoF variants from WT-like variants. Surprisingly, REVEL, an often-used model, achieved a comparable correlation with the real-valued assay output as that seen for the experimental replicates. Performing variant interpretation by combining the new functional evidence with computational and population data evidence led to 16 new variants receiving a clinically actionable classification of likely pathogenic (LP) or likely benign (LB). Overall, the STK11 challenge highlights the utility of variant effect predictors in biomedical sciences and provides encouraging results for driving research in the field of computational genome interpretation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    苯丙酮尿症(PKU)是由苯丙氨酸羟化酶(PAH)基因变异引起的遗传性疾病。在3369个报告的PAH变体中,33.7%是错义改变。不幸的是,这些错义变异中有30%被归类为未知意义的变异(VUS),对遗传风险评估提出了挑战。在我们的研究中,我们按照美国医学遗传学和基因组学学会/分子病理学协会(ACMG/AMP)指南,重点分析了由ClinGenPAH变异体固化专家小组(VCEP)标准规定的836种错义PAH变异体.我们使用并比较了诸如Franklin和Varsome之类的变体注释器工具,进行了PAH的3D结构分析,并检查了活跃和监管地点的热点。此外,我们评估了明显错义变异体的潜在剪接效应.通过评估22962例PKU患者的表型数据,我们的目的是重新评估错义变异的致病性.我们的综合方法成功地将836个错义变异中的309个VUS重新分类为可能的致病性或致病性(37%),将370种可能的致病变种升级为致病变种,并将以前认为可能的良性变异重新分类为可能的致病性。636个错义变体的表型信息可用,其中441例进行了180种变体的3D结构分析和活性位点热点鉴定。经过我们的分析,只有6%的错义变异被归类为VUS,和其中三个(c.23A>C/p。Asn8Thr,c.59_60delinsCC/p。Gln20Pro,和c.278A>T/p。Asn93Ile)可能受到异常剪接的影响。此外,致病性变异(c.168G>T/p。Glu56Asp)被确定为共有剪接位点修饰的风险超过98%,高分表明捐献者损失为0.94。ACMG/AMP指南与计算机结构分析和表型数据的整合显着减少了错义VUS的数量,为遗传咨询提供坚实的基础,并强调代谢表型信息在变异管理中的重要性。这项研究也揭示了PAH变体的当前景观。
    Phenylketonuria (PKU) is a genetic disorder caused by variations in the phenylalanine hydroxylase (PAH) gene. Among the 3369 reported PAH variants, 33.7% are missense alterations. Unfortunately, 30% of these missense variants are classified as variants of unknown significance (VUS), posing challenges for genetic risk assessment. In our study, we focused on analyzing 836 missense PAH variants following the American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) guidelines specified by ClinGen PAH Variant Curation Expert Panel (VCEP) criteria. We utilized and compared variant annotator tools like Franklin and Varsome, conducted 3D structural analysis of PAH, and examined active and regulatory site hotspots. In addition, we assessed potential splicing effect of apparent missense variants. By evaluating phenotype data from 22962 PKU patients, our aim was to reassess the pathogenicity of missense variants. Our comprehensive approach successfully reclassified 309 VUSs out of 836 missense variants as likely pathogenic or pathogenic (37%), upgraded 370 likely pathogenic variants to pathogenic, and reclassified one previously considered likely benign variant as likely pathogenic. Phenotypic information was available for 636 missense variants, with 441 undergoing 3D structural analysis and active site hotspot identification for 180 variants. After our analysis, only 6% of missense variants were classified as VUSs, and three of them (c.23A>C/p.Asn8Thr, c.59_60delinsCC/p.Gln20Pro, and c.278A >T/p.Asn93Ile) may be influenced by abnormal splicing. Moreover, a pathogenic variant (c.168G>T/p.Glu56Asp) was identified to have a risk exceeding 98% for modifications of the consensus splice site, with high scores indicating a donor loss of 0.94. The integration of ACMG/AMP guidelines with in silico structural analysis and phenotypic data significantly reduced the number of missense VUSs, providing a strong basis for genetic counseling and emphasizing the importance of metabolic phenotype information in variant curation. This study also sheds light on the current landscape of PAH variants.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    变体效应预测的持续进展对于证明机器学习方法准确确定未知意义变体(VUS)的临床影响的能力是必要的。为了这个目标,ARSA基因组解释关键评估(CAGI)挑战旨在通过利用219个实验测定的芳基磺胺酶A(ARSA)基因中的错义VUS来评估社区提交的变体功能效应预测的性能来表征进展。挑战涉及15个团队,并评估了已建立和最近发布的模型的其他预测。值得注意的是,由遗传学和编码训练营的参与者开发的模型,用Python中的标准机器学习工具训练,在分任务中表现优异。此外,该研究观察到,与不太复杂的技术相比,最先进的深度学习方法在预测性能方面提供了很小但具有统计学意义的改进。这些发现强调了变异效应预测的效用,以及用适度资源训练的模型在遗传和临床研究中准确分类VUS的潜力。
    Continued advances in variant effect prediction are necessary to demonstrate the ability of machine learning methods to accurately determine the clinical impact of variants of unknown significance (VUS). Towards this goal, the ARSA Critical Assessment of Genome Interpretation (CAGI) challenge was designed to characterize progress by utilizing 219 experimentally assayed missense VUS in the Arylsulfatase A (ARSA) gene to assess the performance of community-submitted predictions of variant functional effects. The challenge involved 15 teams, and evaluated additional predictions from established and recently released models. Notably, a model developed by participants of a genetics and coding bootcamp, trained with standard machine-learning tools in Python, demonstrated superior performance among submissions. Furthermore, the study observed that state-of-the-art deep learning methods provided small but statistically significant improvement in predictive performance compared to less elaborate techniques. These findings underscore the utility of variant effect prediction, and the potential for models trained with modest resources to accurately classify VUS in genetic and clinical research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    全基因组关联研究(GWAS)为阐明常见多基因疾病的遗传基础提供了关键基础。然而,这些研究在将因果关系分配给特定遗传变异的能力方面存在局限性,尤其是那些驻留在非编码基因组中的.在过去的十年里,在非编码变体的分析和经验优先级划分方面的技术和方法上的进步使得能够通过利用越来越多的正交功能证据来识别致病变体。在这次审查中,我们对这些方法进行了概述,并描述了该工作流程如何为超越关联而转向多基因疾病的分子和细胞机制的遗传知情研究提供必要的基础.
    Genome-wide association studies (GWASs) provide a key foundation for elucidating the genetic underpinnings of common polygenic diseases. However, these studies have limitations in their ability to assign causality to particular genetic variants, especially those residing in the noncoding genome. Over the past decade, technological and methodological advances in both analytical and empirical prioritization of noncoding variants have enabled the identification of causative variants by leveraging orthogonal functional evidence at increasing scale. In this review, we present an overview of these approaches and describe how this workflow provides the groundwork necessary to move beyond associations toward genetically informed studies on the molecular and cellular mechanisms of polygenic disease.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在过去的五年中,深度学习模型应用于蛋白质研究取得了令人印象深刻的进展。最值得注意的是,基于序列的结构预测已经看到AlphaFold2和相关方法形式的转化性增益。人类中数以百万计的错义蛋白变体缺乏注释,和这些计算方法是一个有价值的手段来优先考虑变量进一步分析。这里,我们回顾了应用于蛋白质结构和蛋白质变体预测的深度学习模型的最新进展,特别强调它们对人类遗传学和健康的影响。改进的蛋白质结构预测有助于注释变体对蛋白质稳定性的影响,蛋白质-蛋白质相互作用界面,和小分子结合袋。此外,它有助于研究宿主-病原体相互作用和蛋白质功能的表征。随着大型队列中的基因组测序变得越来越普遍,我们认为,将最先进的蛋白质信息学技术更好地整合到人类遗传学研究中至关重要。
    The last five years have seen impressive progress in deep learning models applied to protein research. Most notably, sequence-based structure predictions have seen transformative gains in the form of AlphaFold2 and related approaches. Millions of missense protein variants in the human population lack annotations, and these computational methods are a valuable means to prioritize variants for further analysis. Here, we review the recent progress in deep learning models applied to the prediction of protein structure and protein variants, with particular emphasis on their implications for human genetics and health. Improved prediction of protein structures facilitates annotations of the impact of variants on protein stability, protein-protein interaction interfaces, and small-molecule binding pockets. Moreover, it contributes to the study of host-pathogen interactions and the characterization of protein function. As genome sequencing in large cohorts becomes increasingly prevalent, we believe that better integration of state-of-the-art protein informatics technologies into human genetics research is of paramount importance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在过去的20年中,测序方法的技术进步推动了包含常见变异的大量测序数据的产生。以及数百万种常规基因分型无法识别的罕见和个人变异。虽然综合排序在技术上是可行的,其指导个性化治疗决策的临床效用仍存在争议.
    我们讨论了与靶向基因分型相比,综合测序在药物基因组学应用中的机遇和挑战。当前的药物基因组测序小组是异质的,并且所包括的基因的临床可操作性不是主要焦点。我们提供了当前的概述和关键讨论,说明当前的研究如何利用来自生物库的测序数据。数据库或重新利用的诊断测序,或前瞻性地使用药物基因组测序。
    虽然基于测序的药物基因组学为多种药物治疗的安全性和有效性提供了对遗传变异的重要见解,药物基因组测序的临床实施的重要障碍仍然存在。我们发现了药物遗传学变异解释中的差距,与复杂基因座和变异定相有关的技术挑战,以及成本效益不明确和报销不完整。为了实现药物基因组测序的前景,解决这些挑战至关重要。
    UNASSIGNED: The technological advances of sequencing methods during the past 20 years have fuelled the generation of large amounts of sequencing data that comprise common variations, as well as millions of rare and personal variants that would not be identified by conventional genotyping. While comprehensive sequencing is technically feasible, its clinical utility for guiding personalized treatment decisions remains controversial.
    UNASSIGNED: We discuss the opportunities and challenges of comprehensive sequencing compared to targeted genotyping for pharmacogenomic applications. Current pharmacogenomic sequencing panels are heterogeneous and clinical actionability of the included genes is not a major focus. We provide a current overview and critical discussion of how current studies utilize sequencing data either retrospectively from biobanks, databases or repurposed diagnostic sequencing, or prospectively using pharmacogenomic sequencing.
    UNASSIGNED: While sequencing-based pharmacogenomics has provided important insights into genetic variations underlying the safety and efficacy of a multitude pharmacological treatments, important hurdles for the clinical implementation of pharmacogenomic sequencing remain. We identify gaps in the interpretation of pharmacogenetic variants, technical challenges pertaining to complex loci and variant phasing, as well as unclear cost-effectiveness and incomplete reimbursement. It is critical to address these challenges in order to realize the promising prospects of pharmacogenomic sequencing.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    扩大的全基因组关联研究(GWAS)目录提供了跨各种物种的生物学见解,但是确定这些关联背后的因果变异仍然是一个重大挑战.实验验证既是劳动密集型的,又是昂贵的,强调准确的必要性,可扩展的计算方法来预测遗传变异在整个基因组中的影响。受到自然语言处理最新进展的启发,在大型蛋白质序列数据库上的无监督预训练已被证明在提取与蛋白质相关的复杂信息方面是成功的。这些模型展示了它们使用无监督方法在编码区域中学习变体效果的能力。扩展这个想法,我们在这里介绍基因组预训练网络(GPN),一种旨在通过对基因组DNA序列进行无监督预训练来学习全基因组变异效应的模型。我们的模型还成功地在没有任何监督的情况下学习了基因结构和DNA基序。为了证明它的效用,我们在拟南芥的未对齐参考基因组上训练GPN,并通过利用1001基因组项目的等位基因频率和GWAS综合数据库评估其预测拟南芥遗传变异的功能影响的能力。值得注意的是,GPN优于基于流行的保守性得分的预测因子,例如phyloP和phastCons。我们对拟南芥的预测可以在UCSC基因组浏览器(https://genome)中可视化为序列徽标。ucsc.edu/s/gbenegas/gpn-拟南芥)。我们提供代码(https://github.com/songlab-cal/gpn),以单独使用其DNA序列为任何给定物种训练GPN,能够无监督地预测整个基因组的变异效应。
    The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    解释致病遗传变异仍然是人类遗传学和罕见疾病领域的挑战。进行深度突变扫描以绘制变体效应的当前成本和复杂性阻碍了所有疾病相关基因中变体的全基因组分辨率的众包方法。我们的框架,饱和诱变增强功能测定(SMuRF),通过模块化DMS组件来解决这些问题,提供简单且具有成本效益的饱和诱变,以及简化功能测定以增强对未解决变体的解释。将SMuRF应用于神经肌肉疾病基因FKRP和LARGE1,我们已经为超过99.8%的所有可能的编码单核苷酸变体(SNV)产生了功能评分,为营养不良症的临床变异解释提供了额外的证据。从SMuRF生成的数据可实现严重性预测,解析易受错义破坏的关键蛋白质结构区域,并为开发计算预测因子提供训练数据集。总之,我们的方法提供了一个框架,可以通过跨标准研究实验室进行众包实施的方式,实现对疾病基因的变异-功能洞察.
    Interpretation of disease-causing genetic variants remains a challenge in human genetics. Current costs and complexity of deep mutational scanning methods hamper crowd-sourcing approaches toward genome-wide resolution of variants in disease-related genes. Our framework, Saturation Mutagenesis-Reinforced Functional assays (SMuRF), addresses these issues by offering simple and cost-effective saturation mutagenesis, as well as streamlining functional assays to enhance the interpretation of unresolved variants. Applying SMuRF to neuromuscular disease genes FKRP and LARGE1, we generated functional scores for all possible coding single nucleotide variants, which aid in resolving clinically reported variants of uncertain significance. SMuRF also demonstrates utility in predicting disease severity, resolving critical structural regions, and providing training datasets for the development of computational predictors. Our approach opens new directions for enabling variant-to-function insights for disease genes in a manner that is broadly useful for crowd-sourcing implementation across standard research laboratories.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号