关键词: Deep learning Generative adversarial networks Machine learning Somatic variants Variants annotation Variants interpretation

Mesh : Humans Algorithms Machine Learning Neoplasms / genetics Databases, Factual Supervised Machine Learning

来  源:   DOI:10.1186/s12859-023-05141-2   PDF(Pubmed)

Abstract:
BACKGROUND: It remains an important challenge to predict the functional consequences or clinical impacts of genetic variants in human diseases, such as cancer. An increasing number of genetic variants in cancer have been discovered and documented in public databases such as COSMIC, but the vast majority of them have no functional or clinical annotations. Some databases, such as CiVIC are available with manual annotation of functional mutations, but the size of the database is small due to the use of human annotation. Since the unlabeled data (millions of variants) typically outnumber labeled data (thousands of variants), computational tools that take advantage of unlabeled data may improve prediction accuracy.
RESULTS: To leverage unlabeled data to predict functional importance of genetic variants, we introduced a method using semi-supervised generative adversarial networks (SGAN), incorporating features from both labeled and unlabeled data. Our SGAN model incorporated features from clinical guidelines and predictive scores from other computational tools. We also performed comparative analysis to study factors that influence prediction accuracy, such as using different algorithms, types of features, and training sample size, to provide more insights into variant prioritization. We found that SGAN can achieve competitive performances with small labeled training samples by incorporating unlabeled samples, which is a unique advantage compared to traditional machine learning methods. We also found that manually curated samples can achieve a more stable predictive performance than publicly available datasets.
CONCLUSIONS: By incorporating much larger samples of unlabeled data, the SGAN method can improve the ability to detect novel oncogenic variants, compared to other machine-learning algorithms that use only labeled datasets. SGAN can be potentially used to predict the pathogenicity of more complex variants such as structural variants or non-coding variants, with the availability of more training samples and informative features.
摘要:
背景:预测遗传变异在人类疾病中的功能后果或临床影响仍然是一个重要的挑战,比如癌症。在COSMIC等公共数据库中发现并记录了越来越多的癌症遗传变异,但其中绝大多数没有功能或临床注释。一些数据库,例如CIVIC可以手动注释功能突变,但是由于使用人工注释,数据库的大小很小。由于未标记的数据(数百万个变体)通常超过标记的数据(数千个变体),利用未标记数据的计算工具可以提高预测准确性。
结果:为了利用未标记的数据来预测遗传变异的功能重要性,我们介绍了一种使用半监督生成对抗网络(SGAN)的方法,合并来自标记和未标记数据的特征。我们的SGAN模型结合了来自临床指南的特征和来自其他计算工具的预测评分。我们还进行了比较分析,研究了影响预测精度的因素,比如使用不同的算法,特征的类型,和训练样本大小,提供对变体优先级的更多见解。我们发现SGAN可以通过纳入未标记的样本,用小的标记训练样本来实现竞争表现,与传统的机器学习方法相比,这是一个独特的优势。我们还发现,与公开可用的数据集相比,手动筛选的样本可以实现更稳定的预测性能。
结论:通过合并更大的未标记数据样本,SGAN方法可以提高检测新的致癌变体的能力,与仅使用标记数据集的其他机器学习算法相比。SGAN可以潜在地用于预测更复杂的变异如结构变异或非编码变异的致病性,具有更多训练样本和信息丰富的功能。
公众号