背景:观察性生物医学研究促进了大规模电子健康记录(EHR)利用的新策略,以支持精准医学。然而,数据标签不可访问性是临床预测中越来越重要的问题,尽管使用了合成和半监督的数据学习。很少有研究旨在揭示EHR的潜在图形结构。
目的:提出了一种基于网络的生成对抗半监督方法。目的是在标签缺陷EHR上训练临床预测模型,以实现与监督方法相当的学习性能。
方法:选取来自浙江大学附属第二医院的3个公开数据集和1个大肠癌数据集作为基准。所提出的模型在5%至25%的标记数据上进行了训练,并针对常规的半监督和监督方法对分类指标进行了评估。数据质量,模型安全,和内存可伸缩性也进行了评估。
结果:在相同的设置下,提出的半监督分类方法优于相关的半监督方法,四个数据集的接收器工作特征曲线(AUC)下的平均面积达到0.945、0.673、0.611和0.588,分别,其次是基于图的半监督学习(分别为0.450、0.454、0.425和0.5676)和标签传播(分别为0.475、0.344、0.440和0.477)。10%标记数据的平均分类AUC分别为0.929、0.719、0.652和0.650,与监督学习方法逻辑回归(分别为0.601、0.670、0.731和0.710)相当,支持向量机(分别为0.733、0.720、0.720和0.721),和随机森林(分别为0.982、0.750、0.758和0.740)。通过现实的数据合成和强大的隐私保护,可以缓解有关数据二次使用和数据安全的担忧。
结论:在数据驱动的研究中,对标签缺陷型EHR的临床预测模型进行训练是必不可少的。所提出的方法具有利用EHR的内在结构并实现与监督方法相当的学习性能的巨大潜力。
BACKGROUND: Observational biomedical studies facilitate a new strategy for large-scale electronic health record (EHR) utilization to support precision medicine. However, data label inaccessibility is an increasingly important issue in clinical prediction, despite the use of synthetic and semisupervised learning from data. Little research has aimed to uncover the underlying graphical structure of EHRs.
OBJECTIVE: A network-based generative adversarial semisupervised method is proposed. The objective is to train clinical prediction models on label-deficient EHRs to achieve comparable learning performance to supervised methods.
METHODS: Three public data sets and one colorectal cancer data set gathered from the Second Affiliated Hospital of Zhejiang University were selected as benchmarks. The proposed models were trained on 5% to 25% labeled data and evaluated on classification metrics against conventional semisupervised and supervised methods. The data quality, model security, and memory scalability were also evaluated.
RESULTS: The proposed method for semisupervised classification outperforms related semisupervised methods under the same setup, with the average area under the receiver operating characteristics curve (AUC) reaching 0.945, 0.673, 0.611, and 0.588 for the four data sets, respectively, followed by graph-based semisupervised learning (0.450, 0.454, 0.425, and 0.5676, respectively) and label propagation (0.475,0.344, 0.440, and 0.477, respectively). The average classification AUCs with 10% labeled data were 0.929, 0.719, 0.652, and 0.650, respectively, comparable to that of the supervised learning methods logistic regression (0.601, 0.670, 0.731, and 0.710, respectively), support vector machines (0.733, 0.720, 0.720, and 0.721, respectively), and random forests (0.982, 0.750, 0.758, and 0.740, respectively). The concerns regarding the secondary use of data and data security are alleviated by realistic data synthesis and robust privacy preservation.
CONCLUSIONS: Training clinical prediction models on label-deficient EHRs is indispensable in data-driven research. The proposed method has great potential to exploit the intrinsic structure of EHRs and achieve comparable learning performance to supervised methods.