关键词: Electronic Health Records Embedding Machine Learning Missing Data

Mesh : Electronic Health Records Humans Clinical Deterioration Models, Statistical Clinical Laboratory Techniques

来  源:   DOI:10.1186/s12911-024-02612-1   PDF(Pubmed)

Abstract:
BACKGROUND: Electronic Health Records (EHR) are widely used to develop clinical prediction models (CPMs). However, one of the challenges is that there is often a degree of informative missing data. For example, laboratory measures are typically taken when a clinician is concerned that there is a need. When data are the so-called Not Missing at Random (NMAR), analytic strategies based on other missingness mechanisms are inappropriate. In this work, we seek to compare the impact of different strategies for handling missing data on CPMs performance.
METHODS: We considered a predictive model for rapid inpatient deterioration as an exemplar implementation. This model incorporated twelve laboratory measures with varying levels of missingness. Five labs had missingness rate levels around 50%, and the other seven had missingness levels around 90%. We included them based on the belief that their missingness status can be highly informational for the prediction. In our study, we explicitly compared the various missing data strategies: mean imputation, normal-value imputation, conditional imputation, categorical encoding, and missingness embeddings. Some of these were also combined with the last observation carried forward (LOCF). We implemented logistic LASSO regression, multilayer perceptron (MLP), and long short-term memory (LSTM) models as the downstream classifiers. We compared the AUROC of testing data and used bootstrapping to construct 95% confidence intervals.
RESULTS: We had 105,198 inpatient encounters, with 4.7% having experienced the deterioration outcome of interest. LSTM models generally outperformed other cross-sectional models, where embedding approaches and categorical encoding yielded the best results. For the cross-sectional models, normal-value imputation with LOCF generated the best results.
CONCLUSIONS: Strategies that accounted for the possibility of NMAR missing data yielded better model performance than those did not. The embedding method had an advantage as it did not require prior clinical knowledge. Using LOCF could enhance the performance of cross-sectional models but have countereffects in LSTM models.
摘要:
背景:电子健康记录(EHR)被广泛用于开发临床预测模型(CPM)。然而,挑战之一是通常存在一定程度的信息缺失数据。例如,当临床医生担心需要时,通常会采取实验室措施。当数据是所谓的“随机不丢失”(NMAR)时,基于其他错误机制的分析策略是不合适的。在这项工作中,我们试图比较处理缺失数据的不同策略对CPM性能的影响。
方法:我们考虑了住院患者快速恶化的预测模型作为一个范例。该模型结合了十二种具有不同程度的错误的实验室措施。五个实验室的错误率在50%左右,其他七个人的不良程度约为90%。我们基于这样的信念将它们包括在内,即它们的不良状态可以为预测提供高度信息。在我们的研究中,我们明确地比较了各种缺失数据的策略:均值填补,正常值插补,有条件的归责,分类编码,和错误嵌入。其中一些还与上次结转的观察结果(LOCF)相结合。我们实施了逻辑LASSO回归,多层感知器(MLP),和长期短期记忆(LSTM)模型作为下游分类器。我们比较了测试数据的AUROC,并使用自举构建了95%的置信区间。
结果:我们有105,198例住院患者,4.7%的人经历了兴趣恶化的结果。LSTM模型通常优于其他横截面模型,其中嵌入方法和分类编码产生了最好的结果。对于横截面模型,用LOCF进行正常值填补产生了最好的结果。
结论:考虑NMAR数据缺失可能性的策略比那些没有的策略产生了更好的模型性能。嵌入方法具有优势,因为它不需要事先的临床知识。使用LOCF可以增强横截面模型的性能,但在LSTM模型中有反差。
公众号