关键词: artificial intelligence genetic variation analysis machine learning next-generation sequencing whole-exome sequencing

来  源:   DOI:10.2196/37701   PDF(Pubmed)

Abstract:
BACKGROUND: In recent years, thanks to the rapid development of next-generation sequencing (NGS) technology, an entire human genome can be sequenced in a short period. As a result, NGS technology is now being widely introduced into clinical diagnosis practice, especially for diagnosis of hereditary disorders. Although the exome data of single-nucleotide variant (SNV) can be generated using these approaches, processing the DNA sequence data of a patient requires multiple tools and complex bioinformatics pipelines.
OBJECTIVE: This study aims to assist physicians to automatically interpret the genetic variation information generated by NGS in a short period. To determine the true causal variants of a patient with genetic disease, currently, physicians often need to view numerous features on every variant manually and search for literature in different databases to understand the effect of genetic variation.
METHODS: We constructed a machine learning model for predicting disease-causing variants in exome data. We collected sequencing data from whole-exome sequencing (WES) and gene panel as training set, and then integrated variant annotations from multiple genetic databases for model training. The model built ranked SNVs and output the most possible disease-causing candidates. For model testing, we collected WES data from 108 patients with rare genetic disorders in National Taiwan University Hospital. We applied sequencing data and phenotypic information automatically extracted by a keyword extraction tool from patient\'s electronic medical records into our machine learning model.
RESULTS: We succeeded in locating 92.5% (124/134) of the causative variant in the top 10 ranking list among an average of 741 candidate variants per person after filtering. AI Variant Prioritizer was able to assign the target gene to the top rank for around 61.1% (66/108) of the patients, followed by Variant Prioritizer, which assigned it for 44.4% (48/108) of the patients. The cumulative rank result revealed that our AI Variant Prioritizer has the highest accuracy at ranks 1, 5, 10, and 20. It also shows that AI Variant Prioritizer presents better performance than other tools. After adopting the Human Phenotype Ontology (HPO) terms by looking up the databases, the top 10 ranking list can be increased to 93.5% (101/108).
CONCLUSIONS: We successfully applied sequencing data from WES and free-text phenotypic information of patient\'s disease automatically extracted by the keyword extraction tool for model training and testing. By interpreting our model, we identified which features of variants are important. Besides, we achieved a satisfactory result on finding the target variant in our testing data set. After adopting the HPO terms by looking up the databases, the top 10 ranking list can be increased to 93.5% (101/108). The performance of the model is similar to that of manual analysis, and it has been used to help National Taiwan University Hospital with a genetic diagnosis.
摘要:
背景:近年来,得益于下一代测序(NGS)技术的快速发展,整个人类基因组可以在短时间内测序。因此,NGS技术现已被广泛引入临床诊断实践,特别是对遗传性疾病的诊断。尽管可以使用这些方法生成单核苷酸变体(SNV)的外显子组数据,处理患者的DNA序列数据需要多种工具和复杂的生物信息学管道。
目的:本研究旨在帮助医生在短时间内自动解释NGS产生的遗传变异信息。为了确定遗传病患者的真正因果变异,目前,医生通常需要手动查看每个变异的许多特征,并在不同的数据库中搜索文献,以了解遗传变异的影响。
方法:我们构建了一个机器学习模型,用于预测外显子组数据中的致病变异。我们从全外显子组测序(WES)和基因面板收集测序数据作为训练集,然后整合来自多个遗传数据库的变体注释进行模型训练。建立的模型对SNV进行排名,并输出最可能的致病候选物。对于模型试验,我们收集了台大医院108例罕见遗传性疾病患者的WES数据.我们将通过关键字提取工具从患者的电子病历中自动提取的测序数据和表型信息应用到我们的机器学习模型中。
结果:我们成功地在过滤后平均每人741个候选变异中,将92.5%(124/134)的致病变异定位在前10名排名中。AIVariantPriorizer能够将目标基因分配到大约61.1%(66/108)的患者的最高等级,其次是变体优先排序器,将其分配给44.4%(48/108)的患者。累积排名结果显示,我们的AIVariantPriorizer在排名1、5、10和20时具有最高的准确性。它还表明AIVariantPriorizer比其他工具具有更好的性能。在通过查找数据库采用人类表型本体论(HPO)术语后,排名前10位的排名可以提高到93.5%(101/108)。
结论:我们成功地将来自WES的测序数据和关键词提取工具自动提取的患者疾病的自由文本表型信息用于模型训练和测试。通过解释我们的模型,我们确定了变异体的哪些特征是重要的。此外,我们在测试数据集中发现目标变异,取得了令人满意的结果.通过查找数据库采用HPO术语后,排名前10位的排名可以提高到93.5%(101/108)。该模型的性能与手动分析相似,它已被用来帮助国立台湾大学医院进行基因诊断。
公众号