关键词: analytics diagnostics electronic health records imputation missing data

来  源:   DOI:10.2147/CLEP.S436131   PDF(Pubmed)

Abstract:
UNASSIGNED: Partially observed confounder data pose challenges to the statistical analysis of electronic health records (EHR) and systematic assessments of potentially underlying missingness mechanisms are lacking. We aimed to provide a principled approach to empirically characterize missing data processes and investigate performance of analytic methods.
UNASSIGNED: Three empirical sub-cohorts of diabetic SGLT2 or DPP4-inhibitor initiators with complete information on HbA1c, BMI and smoking as confounders of interest (COI) formed the basis of data simulation under a plasmode framework. A true null treatment effect, including the COI in the outcome generation model, and four missingness mechanisms for the COI were simulated: completely at random (MCAR), at random (MAR), and two not at random (MNAR) mechanisms, where missingness was dependent on an unmeasured confounder and on the value of the COI itself. We evaluated the ability of three groups of diagnostics to differentiate between mechanisms: 1)-differences in characteristics between patients with or without the observed COI (using averaged standardized mean differences [ASMD]), 2)-predictive ability of the missingness indicator based on observed covariates, and 3)-association of the missingness indicator with the outcome. We then compared analytic methods including \"complete case\", inverse probability weighting, single and multiple imputation in their ability to recover true treatment effects.
UNASSIGNED: The diagnostics successfully identified characteristic patterns of simulated missingness mechanisms. For MAR, but not MCAR, the patient characteristics showed substantial differences (median ASMD 0.20 vs 0.05) and consequently, discrimination of the prediction models for missingness was also higher (0.59 vs 0.50). For MNAR, but not MAR or MCAR, missingness was significantly associated with the outcome even in models adjusting for other observed covariates. Comparing analytic methods, multiple imputation using a random forest algorithm resulted in the lowest root-mean-squared-error.
UNASSIGNED: Principled diagnostics provided reliable insights into missingness mechanisms. When assumptions allow, multiple imputation with nonparametric models could help reduce bias.
摘要:
部分观察到的混淆数据对电子健康记录(EHR)的统计分析提出了挑战,并且缺乏对潜在潜在潜在错误机制的系统评估。我们旨在提供一种有原则的方法来根据经验描述缺失的数据过程并研究分析方法的性能。
糖尿病SGLT2或DPP4抑制剂引发剂的三个经验子队列,具有关于HbA1c的完整信息,BMI和吸烟作为感兴趣的混杂因素(COI)构成了等离子体模型框架下数据模拟的基础。真正的无效治疗效果,包括结果生成模型中的COI,并模拟了COI的四种错误机制:完全随机(MCAR),随机(MAR),和两种非随机(MNAR)机制,其中错误取决于无法衡量的混淆者和COI本身的价值。我们评估了三组诊断区分机制的能力:1)-有或没有观察到的COI的患者之间的特征差异(使用平均标准化平均差[ASMD]),2)-基于观察到的协变量的错误指标的预测能力,和3)-不良指标与结果的关联。然后,我们比较了分析方法,包括“完整案例”,逆概率加权,单一和多重补偿他们恢复真正治疗效果的能力。
诊断成功地确定了模拟错误机制的特征模式。对于MAR,但不是MCAR,患者特征显示出实质性差异(ASMD中位数0.20vs0.05),因此,错误预测模型的辨别度也较高(0.59比0.50)。对于MNAR,但不是MAR或MCAR,即使在调整其他观察到的协变量的模型中,错误也与结果显着相关。比较分析方法,使用随机森林算法进行多重插补的结果是最小的均方根误差。
原理诊断为错误机制提供了可靠的见解。当假设允许时,使用非参数模型进行多重填补可以帮助减少偏差。
公众号