背景:患有罕见疾病的家庭面临的主要障碍是获得遗传诊断。平均“诊断冒险”持续五年以上,因果变异在50%以下被确定,即使在全基因组捕获变异。为了帮助对检测到的大量变体进行解释和优先排序,计算方法正在激增。尚不清楚哪些工具最有效。为了评估计算方法的性能,并鼓励方法开发的创新,我们设计了一项基因组解释关键评估(CAGI)社区挑战,将变异体优先排序模型置于现实生活中的临床诊断环境中.
方法:我们利用了稀有基因组计划(RGP)中测序的家族的基因组测序(GS)数据,一项关于GS用于罕见疾病诊断和基因发现的直接参与者研究。向挑战预测因子提供了来自175个RGP个体(65个家庭)的变体调用和表型术语的数据集,包括35个已解决的训练集族,并指定了因果变体,和30个未标记的测试集系列(14个已解决,16个未解决)。我们要求团队在尽可能多的家庭中识别因果变异。预测器提交了具有估计的因果关系概率(EPCR)值的变体预测。模型性能由两个指标决定,基于因果变体的排名位置的加权分数,和最大F度量,基于所有EPCR值中因果变异的精确度和召回率。
结果:16个团队提交了52个模型的预测,一些结合了手动审查。表现最好的人在排名前5位的变异中,在14个已解决的家庭中,多达13个召回了因果变异。新发现的诊断变异在确认的RNA测序后返回到两个以前未解决的家族。和两个新的疾病基因候选进入媒人交易所。在一个例子中,RNA测序表明,由于ASNS中的深层内含子插入缺失,在未解决的先证中以反式鉴定出具有移码变体,其表型与天冬酰胺合成酶缺乏症一致。
结论:模型方法和性能差异很大。模型称重呼叫质量,等位基因频率,预测的有害性,隔离,和表型在识别因果变异方面是有效的,并且对于表型扩展和非编码变异开放的模型能够捕获更困难的诊断并发现新的诊断。总的来说,计算模型可以显着帮助变体优先化。为了在诊断中使用,需要根据既定标准对优先变种进行详细的审查和保守评估.
BACKGROUND: A major obstacle faced by families with rare diseases is obtaining a genetic diagnosis. The average \"diagnostic odyssey\" lasts over five years and causal variants are identified in under 50%, even when capturing variants genome-wide. To aid in the interpretation and prioritization of the vast number of variants detected, computational methods are proliferating. Knowing which tools are most effective remains unclear. To evaluate the performance of computational methods, and to encourage innovation in method development, we designed a Critical Assessment of Genome Interpretation (CAGI) community challenge to place variant prioritization models head-to-head in a real-life clinical diagnostic setting.
METHODS: We utilized genome sequencing (GS) data from families sequenced in the Rare Genomes Project (RGP), a direct-to-participant research study on the utility of GS for rare disease diagnosis and gene discovery. Challenge predictors were provided with a dataset of variant calls and phenotype terms from 175 RGP individuals (65 families), including 35 solved training set families with causal variants specified, and 30 unlabeled test set families (14 solved, 16 unsolved). We tasked teams to identify causal variants in as many families as possible. Predictors submitted variant predictions with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on the rank position of causal variants, and the maximum F-measure, based on precision and recall of causal variants across all EPCR values.
RESULTS: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performers recalled causal variants in up to 13 of 14 solved families within the top 5 ranked variants. Newly discovered diagnostic variants were returned to two previously unsolved families following confirmatory RNA sequencing, and two novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant in an unsolved proband with phenotypes consistent with asparagine synthetase deficiency.
CONCLUSIONS: Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.