RESULTS: To leverage unlabeled data to predict functional importance of genetic variants, we introduced a method using semi-supervised generative adversarial networks (SGAN), incorporating features from both labeled and unlabeled data. Our SGAN model incorporated features from clinical guidelines and predictive scores from other computational tools. We also performed comparative analysis to study factors that influence prediction accuracy, such as using different algorithms, types of features, and training sample size, to provide more insights into variant prioritization. We found that SGAN can achieve competitive performances with small labeled training samples by incorporating unlabeled samples, which is a unique advantage compared to traditional machine learning methods. We also found that manually curated samples can achieve a more stable predictive performance than publicly available datasets.
CONCLUSIONS: By incorporating much larger samples of unlabeled data, the SGAN method can improve the ability to detect novel oncogenic variants, compared to other machine-learning algorithms that use only labeled datasets. SGAN can be potentially used to predict the pathogenicity of more complex variants such as structural variants or non-coding variants, with the availability of more training samples and informative features.
结果:为了利用未标记的数据来预测遗传变异的功能重要性,我们介绍了一种使用半监督生成对抗网络(SGAN)的方法,合并来自标记和未标记数据的特征。我们的SGAN模型结合了来自临床指南的特征和来自其他计算工具的预测评分。我们还进行了比较分析,研究了影响预测精度的因素,比如使用不同的算法,特征的类型,和训练样本大小,提供对变体优先级的更多见解。我们发现SGAN可以通过纳入未标记的样本,用小的标记训练样本来实现竞争表现,与传统的机器学习方法相比,这是一个独特的优势。我们还发现,与公开可用的数据集相比,手动筛选的样本可以实现更稳定的预测性能。
结论:通过合并更大的未标记数据样本,SGAN方法可以提高检测新的致癌变体的能力,与仅使用标记数据集的其他机器学习算法相比。SGAN可以潜在地用于预测更复杂的变异如结构变异或非编码变异的致病性,具有更多训练样本和信息丰富的功能。