医疗保健预测任务的诊断代码中的患者嵌入： Pat2Vec 机器学习框架。Patient Embeddings From Diagnosis Codes for Health Care Prediction Tasks: Pat2Vec Machine Learning Framework.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: In health care, diagnosis codes in claims data and electronic health records (EHRs) play an important role in data-driven decision making. Any analysis that uses a patient\'s diagnosis codes to predict future outcomes or describe morbidity requires a numerical representation of this diagnosis profile made up of string-based diagnosis codes. These numerical representations are especially important for machine learning models. Most commonly, binary-encoded representations have been used, usually for a subset of diagnoses. In real-world health care applications, several issues arise: patient profiles show high variability even when the underlying diseases are the same, they may have gaps and not contain all available information, and a large number of appropriate diagnoses must be considered.
OBJECTIVE: We herein present Pat2Vec, a self-supervised machine learning framework inspired by neural network-based natural language processing that embeds complete diagnosis profiles into a small real-valued numerical vector.
METHODS: Based on German outpatient claims data with diagnosis codes according to the International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10), we discovered an optimal vectorization embedding model for patient diagnosis profiles with Bayesian optimization for the hyperparameters. The calibration process ensured a robust embedding model for health care-relevant tasks by aggregating the metrics of different regression and classification tasks using different machine learning algorithms (linear and logistic regression as well as gradient-boosted trees). The models were tested against a baseline model that binary encodes the most common diagnoses. The study used diagnosis profiles and supplementary data from more than 10 million patients from 2016 to 2019 and was based on the largest German ambulatory claims data set. To describe subpopulations in health care, we identified clusters (via density-based clustering) and visualized patient vectors in 2D (via dimensionality reduction with uniform manifold approximation). Furthermore, we applied our vectorization model to predict prospective drug prescription costs based on patients\' diagnoses.
RESULTS: Our final models outperform the baseline model (binary encoding) with equal dimensions. They are more robust to missing data and show large performance gains, particularly in lower dimensions, demonstrating the embedding model\'s compression of nonlinear information. In the future, other sources of health care data can be integrated into the current diagnosis-based framework. Other researchers can apply our publicly shared embedding model to their own diagnosis data.
CONCLUSIONS: We envision a wide range of applications for Pat2Vec that will improve health care quality, including personalized prevention and signal detection in patient surveillance as well as health care resource planning based on subcohorts identified by our data-driven machine learning framework.

摘要：

背景：在医疗保健方面，索赔数据和电子健康记录（EHR）中的诊断代码在数据驱动的决策中起着重要作用。使用患者诊断代码来预测未来结果或描述发病率的任何分析都需要由基于字符串的诊断代码组成的诊断配置文件的数字表示。这些数值表示对于机器学习模型尤其重要。最常见的是,已使用二进制编码表示，通常用于诊断的子集。在现实世界的医疗保健应用中，出现了几个问题：即使潜在疾病相同，患者档案也显示出高变异性，他们可能有差距，不包含所有可用的信息，必须考虑大量适当的诊断。
目的：我们在此介绍Pat2Vec，一种自监督的机器学习框架，其灵感来自基于神经网络的自然语言处理，该框架将完整的诊断配置文件嵌入到一个小的实值数值向量中。
方法：基于德国门诊索赔数据，根据国际疾病和相关健康问题统计分类的诊断代码，第十次修订（ICD-10），我们发现了一个最佳的矢量化嵌入模型的病人诊断配置文件与贝叶斯优化的超参数。校准过程通过使用不同的机器学习算法(线性和逻辑回归以及梯度提升树)聚合不同的回归和分类任务的度量来确保用于医疗保健相关任务的鲁棒嵌入模型。针对二进制编码最常见诊断的基线模型对模型进行测试。该研究使用了2016年至2019年超过1000万患者的诊断概况和补充数据，并基于德国最大的门诊索赔数据集。为了描述医疗保健中的亚群，我们识别了聚类(通过基于密度的聚类),并在2D中可视化了患者向量(通过使用均匀流形近似的降维).此外，我们应用我们的矢量化模型来预测基于患者诊断的前瞻性药物处方成本.
结果：我们的最终模型在尺寸相等的情况下优于基线模型（二进制编码）。它们对缺失的数据更健壮，并显示出巨大的性能提升，特别是在较低的维度上，演示了嵌入模型对非线性信息的压缩。在未来,其他医疗保健数据来源可以整合到当前的基于诊断的框架中.其他研究人员可以将我们公开共享的嵌入模型应用于他们自己的诊断数据。
结论：我们设想了Pat2Vec的广泛应用，这将提高医疗保健质量，包括患者监测中的个性化预防和信号检测，以及基于我们的数据驱动的机器学习框架确定的子队列的医疗保健资源规划。