OBJECTIVE: We herein present Pat2Vec, a self-supervised machine learning framework inspired by neural network-based natural language processing that embeds complete diagnosis profiles into a small real-valued numerical vector.
METHODS: Based on German outpatient claims data with diagnosis codes according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), we discovered an optimal vectorization embedding model for patient diagnosis profiles with Bayesian optimization for the hyperparameters. The calibration process ensured a robust embedding model for health care-relevant tasks by aggregating the metrics of different regression and classification tasks using different machine learning algorithms (linear and logistic regression as well as gradient-boosted trees). The models were tested against a baseline model that binary encodes the most common diagnoses. The study used diagnosis profiles and supplementary data from more than 10 million patients from 2016 to 2019 and was based on the largest German ambulatory claims data set. To describe subpopulations in health care, we identified clusters (via density-based clustering) and visualized patient vectors in 2D (via dimensionality reduction with uniform manifold approximation). Furthermore, we applied our vectorization model to predict prospective drug prescription costs based on patients\' diagnoses.
RESULTS: Our final models outperform the baseline model (binary encoding) with equal dimensions. They are more robust to missing data and show large performance gains, particularly in lower dimensions, demonstrating the embedding model\'s compression of nonlinear information. In the future, other sources of health care data can be integrated into the current diagnosis-based framework. Other researchers can apply our publicly shared embedding model to their own diagnosis data.
CONCLUSIONS: We envision a wide range of applications for Pat2Vec that will improve health care quality, including personalized prevention and signal detection in patient surveillance as well as health care resource planning based on subcohorts identified by our data-driven machine learning framework.
目的:我们在此介绍Pat2Vec,一种自监督的机器学习框架,其灵感来自基于神经网络的自然语言处理,该框架将完整的诊断配置文件嵌入到一个小的实值数值向量中。
方法:基于德国门诊索赔数据,根据国际疾病和相关健康问题统计分类的诊断代码,第十次修订(ICD-10),我们发现了一个最佳的矢量化嵌入模型的病人诊断配置文件与贝叶斯优化的超参数。校准过程通过使用不同的机器学习算法(线性和逻辑回归以及梯度提升树)聚合不同的回归和分类任务的度量来确保用于医疗保健相关任务的鲁棒嵌入模型。针对二进制编码最常见诊断的基线模型对模型进行测试。该研究使用了2016年至2019年超过1000万患者的诊断概况和补充数据,并基于德国最大的门诊索赔数据集。为了描述医疗保健中的亚群,我们识别了聚类(通过基于密度的聚类),并在2D中可视化了患者向量(通过使用均匀流形近似的降维).此外,我们应用我们的矢量化模型来预测基于患者诊断的前瞻性药物处方成本.
结果:我们的最终模型在尺寸相等的情况下优于基线模型(二进制编码)。它们对缺失的数据更健壮,并显示出巨大的性能提升,特别是在较低的维度上,演示了嵌入模型对非线性信息的压缩。在未来,其他医疗保健数据来源可以整合到当前的基于诊断的框架中.其他研究人员可以将我们公开共享的嵌入模型应用于他们自己的诊断数据。
结论:我们设想了Pat2Vec的广泛应用,这将提高医疗保健质量,包括患者监测中的个性化预防和信号检测,以及基于我们的数据驱动的机器学习框架确定的子队列的医疗保健资源规划。