Missing data

数据缺失
  • 文章类型: Journal Article
    尽管基因组时代取得了许多进步,在评估系统学假设的不确定性方面存在一个持续存在的问题。我们在最近的蟑螂和白蚁(Blattodea)的系统发育史上看到了这一点,在那里取得了巨大的进步,但是研究之间仍然存在重大的矛盾。为了解决这个问题,我们提出了Blattodea的系统发育分析,强调不确定性的识别和量化。我们使用三种方法分析1183个基因结构域(多物种合并推断,串联,和超矩阵-超树混合方法),并在考虑数据质量的同时评估对有争议关系的支持。混合方法-在这里被称为“分层系统发育推断”-将有关数据质量的信息整合到增量树构建框架中。利用这种方法,我们能够识别低或误导性支持的案例,否则是不可能的,并通过后续测试更彻底地探索它们。特别是,质量注释指向具有高引导支持的节点,这些节点后来被证明具有很大的歧义,有时是由于低质量的数据。我们还澄清了与一些顽固节点相关的问题:Anaplectidae的放置缺乏无偏信号,Ectobiidaes.s.和Anapplectoideini需要更多的分类单元采样,大多数Blaberidae之间最深的关系缺乏信号。因此,以前的几个系统发育不确定性现在更接近于被解决(例如,非洲和马达加斯加“Rhabdoblata”spp。是所有其他Blaberidae的姐妹,和Oxyhaloinae是剩余的Blaberidae的姐妹)。总的来说,我们主张采用更多的方法来量化支持,将数据质量考虑在内,以揭示顽固节点的性质。
    Despite the many advances of the genomic era, there is a persistent problem in assessing the uncertainty of phylogenomic hypotheses. We see this in the recent history of phylogenetics for cockroaches and termites (Blattodea), where huge advances have been made, but there are still major inconsistencies between studies. To address this, we present a phylogenetic analysis of Blattodea that emphasizes identification and quantification of uncertainty. We analyze 1183 gene domains using three methods (multi-species coalescent inference, concatenation, and a supermatrix-supertree hybrid approach) and assess support for controversial relationships while considering data quality. The hybrid approach-here dubbed \"tiered phylogenetic inference\"-incorporates information about data quality into an incremental tree building framework. Leveraging this method, we are able to identify cases of low or misleading support that would not be possible otherwise, and explore them more thoroughly with follow-up tests. In particular, quality annotations pointed towards nodes with high bootstrap support that later turned out to have large ambiguities, sometimes resulting from low-quality data. We also clarify issues related to some recalcitrant nodes: Anaplectidae\'s placement lacks unbiased signal, Ectobiidae s.s. and Anaplectoideini need greater taxon sampling, the deepest relationships among most Blaberidae lack signal. As a result, several previous phylogenetic uncertainties are now closer to being resolved (e.g., African and Malagasy \"Rhabdoblatta\" spp. are the sister to all other Blaberidae, and Oxyhaloinae is sister to the remaining Blaberidae). Overall, we argue for more approaches to quantifying support that take data quality into account to uncover the nature of recalcitrant nodes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    描述IRIS®(视觉智能研究)注册表中缺失的社会人口统计数据的患病率,并确定与缺失的社会人口统计数据相关的实践水平特征。
    横断面研究。
    在2020年12月31日之前参加IRIS注册的实践中遇到临床问题的所有患者。
    我们描述了每个社会人口统计学变量(年龄,性别,种族,种族,地理位置,保险类型,和吸烟状况)。每个为注册提供数据的实践都根据患者数量进行分类,医生的数量,地理位置,患者就诊频率,和患者人口统计。
    多变量线性回归用于描述实践水平特征与缺失患者水平社会人口统计学数据的关联。
    这项研究包括了66名477.365患者的电子健康记录,这些患者在参与IRIS注册的3306诊所接受护理。每次练习的患者人数中位数为11.415(四分位距:5849-24148),每次练习的医生人数中位数为3(四分位距:1-7)。出生年份患者社会人口统计学数据缺失的患病率为0.1%,0.4%的性别,种族占24.8%,种族占30.2%,3位数邮政编码为2.3%,国家占14.8%,吸烟状况为5.5%,险种为17.0%。缺失数据的患病率随着时间的推移而增加,并且在州一级有所不同。缺少种族数据与每位患者就诊次数较少的实践相关(P<0.001),照顾更大的非私人保险患者群体(P=0.001),位于城市地区(P<0.001)。频繁的患者就诊与较低的种族缺失患病率相关(P<0.001)。种族(P<0.001),和保险(P<0.001),但缺失吸烟状况的患病率较高(P<0.001)。
    缺少种族存在地理和时间趋势,种族,和IRIS注册表中的保险类型数据。几个实践层面的特点,包括练习尺寸,地理位置,和患者群体,与缺失的社会人口统计数据有关。虽然丢失数据的普遍性和模式可能会在IRIS注册表的未来版本中发生变化,仍然需要开发标准化方法,以最大限度地减少潜在的偏倚来源,并确保整个研究的可重复性.
    专有或商业披露可在本文末尾的脚注和披露中找到。
    UNASSIGNED: To describe the prevalence of missing sociodemographic data in the IRIS® (Intelligent Research in Sight) Registry and to identify practice-level characteristics associated with missing sociodemographic data.
    UNASSIGNED: Cross-sectional study.
    UNASSIGNED: All patients with clinical encounters at practices participating in the IRIS Registry prior to December 31, 2020.
    UNASSIGNED: We describe geographic and temporal trends in the prevalence of missing data for each sociodemographic variable (age, sex, race, ethnicity, geographic location, insurance type, and smoking status). Each practice contributing data to the registry was categorized based on the number of patients, number of physicians, geographic location, patient visit frequency, and patient population demographics.
    UNASSIGNED: Multivariable linear regression was used to describe the association of practice-level characteristics with missing patient-level sociodemographic data.
    UNASSIGNED: This study included the electronic health records of 66 477 365 patients receiving care at 3306 practices participating in the IRIS Registry. The median number of patients per practice was 11 415 (interquartile range: 5849-24 148) and the median number of physicians per practice was 3 (interquartile range: 1-7). The prevalence of missing patient sociodemographic data were 0.1% for birth year, 0.4% for sex, 24.8% for race, 30.2% for ethnicity, 2.3% for 3-digit zip code, 14.8% for state, 5.5% for smoking status, and 17.0% for insurance type. The prevalence of missing data increased over time and varied at the state-level. Missing race data were associated with practices that had fewer visits per patient (P < 0.001), cared for a larger nonprivately insured patient population (P = 0.001), and were located in urban areas (P < 0.001). Frequent patient visits were associated with a lower prevalence of missing race (P < 0.001), ethnicity (P < 0.001), and insurance (P < 0.001), but a higher prevalence of missing smoking status (P < 0.001).
    UNASSIGNED: There are geographic and temporal trends in missing race, ethnicity, and insurance type data in the IRIS Registry. Several practice-level characteristics, including practice size, geographic location, and patient population, are associated with missing sociodemographic data. While the prevalence and patterns of missing data may change in future versions of the IRIS registry, there will remain a need to develop standardized approaches for minimizing potential sources of bias and ensure reproducibility across research studies.
    UNASSIGNED: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:量表通常来自多项目问卷,但通常面对项目无反应。传统解决方案使用来自可用响应的加权平均值(WMean),但可能会忽略丢失的复杂数据。高级方法,如多重插补(MI),解决更广泛的缺失数据,但是需求增加了计算资源。研究人员经常在“我们所有人的研究计划”(AllofUs)中使用调查数据,并且必须确定使用MI来处理无响应的增加的计算负担是否合理。
    目的:使用我们所有人的5项体力活动邻里环境量表(PANES),这项研究评估了WMean的功效和计算需求之间的权衡,MI,和处理项目不响应时的逆概率加权(IPW)。
    方法:合成错误,允许一个或多个项目不响应,通过3种缺失机制和各种缺失百分比(10%-50%)引入PANES。每个场景都比较了完整问题的WMean,MI,和IPW上的偏置,可变性,覆盖概率,和计算时间。
    结果:对于良好的内部一致性,所有方法均显示最小的偏差(均<5.5%),WMean的一致性很差。随着缺失百分比的增加,IPW显示出相当大的变异性。MI需要更多的计算资源,在完整的数据分析中,比WMean和IPW长>8000和>100倍,分别。
    结论:在高度可靠的尺度下,MI对项目无响应的边际性能优势并不保证其在我们所有人中不断升级的云计算负担,特别是当与计算要求苛刻的后插补分析相结合时。研究人员使用低错误的调查量表可以利用WMean来减轻计算负担。
    OBJECTIVE: Scales often arise from multi-item questionnaires, yet commonly face item non-response. Traditional solutions use weighted mean (WMean) from available responses, but potentially overlook missing data intricacies. Advanced methods like multiple imputation (MI) address broader missing data, but demand increased computational resources. Researchers frequently use survey data in the All of Us Research Program (All of Us), and it is imperative to determine if the increased computational burden of employing MI to handle non-response is justifiable.
    OBJECTIVE: Using the 5-item Physical Activity Neighborhood Environment Scale (PANES) in All of Us, this study assessed the tradeoff between efficacy and computational demands of WMean, MI, and inverse probability weighting (IPW) when dealing with item non-response.
    METHODS: Synthetic missingness, allowing 1 or more item non-response, was introduced into PANES across 3 missing mechanisms and various missing percentages (10%-50%). Each scenario compared WMean of complete questions, MI, and IPW on bias, variability, coverage probability, and computation time.
    RESULTS: All methods showed minimal biases (all <5.5%) for good internal consistency, with WMean suffered most with poor consistency. IPW showed considerable variability with increasing missing percentage. MI required significantly more computational resources, taking >8000 and >100 times longer than WMean and IPW in full data analysis, respectively.
    CONCLUSIONS: The marginal performance advantages of MI for item non-response in highly reliable scales do not warrant its escalated cloud computational burden in All of Us, particularly when coupled with computationally demanding post-imputation analyses. Researchers using survey scales with low missingness could utilize WMean to reduce computing burden.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    已知排序集采样(RSS)可以提高估计器的效率,同时将其与简单的随机抽样进行比较。错误的问题在继续进行估计之前需要解决的信息中造成了差距。已经进行了少量的工作来处理利用RSS的错误。本文提出了一些利用辅助信息估计RSS下总体均值的对数型插补方法。检查了建议的估算程序的属性。完成了仿真研究,以表明与某些现有的插补程序相比,所提出的插补程序具有更好的结果。还提供了所提出的插补程序的实际应用来概括仿真研究。
    Ranked set sampling (RSS) is known to increase the efficiency of the estimators while comparing it with simple random sampling. The problem of missingness creates a gap in the information that needs to be addressed before proceeding for estimation. Negligible amount of work has been carried out to deal with missingness utilizing RSS. This paper proposes some logarithmic type methods of imputation for the estimation of population mean under RSS using auxiliary information. The properties of the suggested imputation procedures are examined. A simulation study is accomplished to show that the proposed imputation procedures exhibit better results in comparison to some of the existing imputation procedures. Few real applications of the proposed imputation procedures is also provided to generalize the simulation study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    多变量孟德尔随机化允许同时估计多个暴露变量对结果的直接因果影响。当感兴趣的暴露变量是定量的组学特征时,获得完整的数据可能在经济和技术上都具有挑战性:测量成本很高,并且测量装置可以具有固有的检测极限。在本文中,在单样本多变量孟德尔随机化分析中,我们提出了一种有效且有效的方法来处理暴露变量的未测量和不可检测值。我们使用最大似然估计来估计直接因果效应,并开发了一种期望最大化算法来计算估计器。我们通过模拟研究展示了所提出方法的优势,并为西班牙裔社区健康研究/拉丁美洲人研究提供了应用,其中有大量未测量的暴露数据。
    Multivariable Mendelian randomization allows simultaneous estimation of direct causal effects of multiple exposure variables on an outcome. When the exposure variables of interest are quantitative omic features, obtaining complete data can be economically and technically challenging: the measurement cost is high, and the measurement devices may have inherent detection limits. In this paper, we propose a valid and efficient method to handle unmeasured and undetectable values of the exposure variables in a one-sample multivariable Mendelian randomization analysis with individual-level data. We estimate the direct causal effects with maximum likelihood estimation and develop an expectation-maximization algorithm to compute the estimators. We show the advantages of the proposed method through simulation studies and provide an application to the Hispanic Community Health Study/Study of Latinos, which has a large amount of unmeasured exposure data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    身体质量指数(BMI)轨迹对于理解BMI如何随时间发展很重要。缺失数据通常被认为是分析BMI随时间变化的研究的局限性,并且探索缺失数据如何影响BMI轨迹的研究有限。本研究探讨了缺失数据对估算BMI轨迹的影响以及对后续分析的影响。本研究使用英国老龄化纵向研究的数据。对50岁及以上的成年人估计了不同的BMI轨迹。接下来,对缺失数据的多种方法进行了实施和比较。然后使用估计的轨迹来预测发展为2型糖尿病(T2DM)的风险。使用每种缺失数据方法识别四个不同的轨迹:稳定超重,BMI升高,不断增加的BMI,降低BMI。然而,个体遵循不同轨迹的可能性在不同方法之间是不同的。在考虑缺失数据后,BMI轨迹对T2DM的影响降低。需要做更多的工作来了解哪些缺失数据的方法最可靠。在估计BMI轨迹时,应该考虑缺失的数据。应调查对缺失数据的核算对成本效益分析的影响程度。
    Body Mass Index (BMI) trajectories are important for understanding how BMI develops over time. Missing data is often stated as a limitation in studies that analyse BMI over time and there is limited research exploring how missing data influences BMI trajectories. This study explores the influence missing data has in estimating BMI trajectories and the impact on subsequent analysis. This study uses data from the English Longitudinal Study of Ageing. Distinct BMI trajectories are estimated for adults aged 50 years and over. Next, multiple methods accounting for missing data are implemented and compared. Estimated trajectories are then used to predict the risk of developing type 2 diabetes mellitus (T2DM). Four distinct trajectories are identified using each of the missing data methods: stable overweight, elevated BMI, increasing BMI, and decreasing BMI. However, the likelihoods of individuals following the different trajectories differ between the different methods. The influence of BMI trajectory on T2DM is reduced after accounting for missing data. More work is needed to understand which methods for missing data are most reliable. When estimating BMI trajectories, missing data should be considered. The extent to which accounting for missing data influences cost-effectiveness analyses should be investigated.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:数据缺失对个体连续血糖监测(CGM)数据的影响未知,但会影响患者的临床决策。
    目的:我们旨在研究数据丢失对来自连续血糖监测仪的个体患者血糖指标的影响,并评估其对临床决策的影响。
    方法:使用FreeStyleLibre传感器(雅培糖尿病护理)收集1型和2型糖尿病患者的CGM数据。我们从每个患者中选择了7-28天的24小时连续数据,没有任何缺失值。为了模拟真实世界的数据丢失,从5%到50%的缺失数据被引入到数据集中.从这个修改的数据集中,临床指标,包括低于范围的时间(TBR),TBR等级2(TBR2),和其他常见的血糖指标在有和没有数据丢失的数据集中计算。由于数据丢失而导致血糖指标相关偏差的记录,根据临床专家的判断,被定义为专家面板边界误差(εEPB)。这些误差表示为记录总数的百分比。研究了葡萄糖管理指标<53mmol/mol的记录错误。
    结果:共有84名患者在28天内完成了798次记录。5%-50%的数据丢失7-28天的记录,对于TBR,εEPB从798(0.0%)中的0到736(20.0%)中的147,而对于TBR2,从612(0.0%)中的0到408(5.4%)中的22。在14天录音的情况下,由于786例中的2例(0.3%)和522例中的32例(6.1%)的数据丢失,TBR和TBR2发作完全消失,分别。然而,消失的TBR和TBR2的初始值相对较小(<0.1%)。在葡萄糖管理指标<53mmol/mol的记录中,εEPB为9.6%持续14天,数据损失为30%。
    结论:在14天的CGM记录中,数据丢失最多30%,缺失数据对各种血糖指标的临床解释影响最小.
    背景:ClinicalTrials.govNCT05584293;https://clinicaltrials.gov/study/NCT05584293。
    BACKGROUND: The impact of missing data on individual continuous glucose monitoring (CGM) data is unknown but can influence clinical decision-making for patients.
    OBJECTIVE: We aimed to investigate the consequences of data loss on glucose metrics in individual patient recordings from continuous glucose monitors and assess its implications on clinical decision-making.
    METHODS: The CGM data were collected from patients with type 1 and 2 diabetes using the FreeStyle Libre sensor (Abbott Diabetes Care). We selected 7-28 days of 24 hours of continuous data without any missing values from each individual patient. To mimic real-world data loss, missing data ranging from 5% to 50% were introduced into the data set. From this modified data set, clinical metrics including time below range (TBR), TBR level 2 (TBR2), and other common glucose metrics were calculated in the data sets with and that without data loss. Recordings in which glucose metrics deviated relevantly due to data loss, as determined by clinical experts, were defined as expert panel boundary error (εEPB). These errors were expressed as a percentage of the total number of recordings. The errors for the recordings with glucose management indicator <53 mmol/mol were investigated.
    RESULTS: A total of 84 patients contributed to 798 recordings over 28 days. With 5%-50% data loss for 7-28 days recordings, the εEPB varied from 0 out of 798 (0.0%) to 147 out of 736 (20.0%) for TBR and 0 out of 612 (0.0%) to 22 out of 408 (5.4%) recordings for TBR2. In the case of 14-day recordings, TBR and TBR2 episodes completely disappeared due to 30% data loss in 2 out of 786 (0.3%) and 32 out of 522 (6.1%) of the cases, respectively. However, the initial values of the disappeared TBR and TBR2 were relatively small (<0.1%). In the recordings with glucose management indicator <53 mmol/mol the εEPB was 9.6% for 14 days with 30% data loss.
    CONCLUSIONS: With a maximum of 30% data loss in 14-day CGM recordings, there is minimal impact of missing data on the clinical interpretation of various glucose metrics.
    BACKGROUND: ClinicalTrials.gov NCT05584293; https://clinicaltrials.gov/study/NCT05584293.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    当响应随机缺失并且某些协变量被误差测量时,我们考虑部分线性模型的最佳模型平均问题。针对模型平均中使用的权重向量,提出了一种基于Mallows类型准则的新型权重选择准则。在实现最小可能的平方损失方面,在某些规律性条件下,对于部分线性模型的所得模型平均估计器被证明是渐近最佳的。此外,建立了局部最小化权重向量的存在性及其对基于风险的最优权重向量的收敛速度。仿真研究表明,所提出的模型平均方法通常优于现有方法。作为一个例证,该方法用于分析HIV-CD4数据集。
    We consider the problem of optimal model averaging for partially linear models when the responses are missing at random and some covariates are measured with error. A novel weight choice criterion based on the Mallows-type criterion is proposed for the weight vector to be used in the model averaging. The resulting model averaging estimator for the partially linear models is shown to be asymptotically optimal under some regularity conditions in terms of achieving the smallest possible squared loss. In addition, the existence of a local minimizing weight vector and its convergence rate to the risk-based optimal weight vector are established. Simulation studies suggest that the proposed model averaging method generally outperforms existing methods. As an illustration, the proposed method is applied to analyze an HIV-CD4 dataset.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:生命体征是评估损伤严重程度和指导创伤复苏的重要因素,尤其是严重受伤的病人。尽管如此,创伤登记处经常缺少生理数据。这项研究旨在评估基于医院的创伤注册表中缺少院前数据的程度,并评估院前生理数据完整性与损伤严重程度指标之间的关联。
    方法:对直接送往多伦多一级创伤中心的所有成年创伤患者进行了回顾性研究,安大略省由护理人员从2015年1月1日至2019年12月31日。评估每个变量的缺失数据比例,并评估缺失模式。调查院前数据完整性与损伤严重程度因素之间的关系,我们进行了描述性和未校正的逻辑回归分析.
    结果:共纳入3,528例患者。我们认为院前数据缺失,如果有心率,收缩压,呼吸频率或氧饱和度不完全。在大约20%的患者中,每个个体变量在注册表中缺失,氧饱和度缺失最常见(n=831;23.6%)。超过25%(n=909)的患者至少缺少一个院前生命体征,其中69.1%(n=628)缺少所有这四个变量。数据不完整的患者受伤更严重,死亡率较高,更频繁地接受救生干预措施,如输血和插管。如果患者在创伤湾死亡,则最有可能丢失院前生理数据(未调整OR:9.79;95%CI:6.35-15.10),无法存活至出院(未调整OR:3.55;95%CI:2.76-4.55),或院前GCS小于9(OR:3.24;95%CI:2.59-4.06)。
    结论:在这个单中心创伤登记中,关键的院前变量经常缺失,尤其是受重伤的患者。数据缺失的患者死亡率较高,更严重的伤害特征,并在创伤湾接受了更多的救命干预措施,提示院前生命体征缺失存在损伤严重程度偏差。为了确保基于创伤登记数据的研究的有效性,必须仔细考虑错误的模式,以确保适当解决丢失的数据。
    BACKGROUND: Vital signs are important factors in assessing injury severity and guiding trauma resuscitation, especially among severely injured patients. Despite this, physiological data are frequently missing from trauma registries. This study aimed to evaluate the extent of missing prehospital data in a hospital-based trauma registry and to assess the associations between prehospital physiological data completeness and indicators of injury severity.
    METHODS: A retrospective review was conducted on all adult trauma patients brought directly to a level 1 trauma center in Toronto, Ontario by paramedics from January 1, 2015, to December 31, 2019. The proportion of missing data was evaluated for each variable and patterns of missingness were assessed. To investigate the associations between prehospital data completeness and injury severity factors, descriptive and unadjusted logistic regression analyses were performed.
    RESULTS: A total of 3,528 patients were included. We considered prehospital data missing if any of heart rate, systolic blood pressure, respiratory rate or oxygen saturation were incomplete. Each individual variable was missing from the registry in approximately 20 % of patients, with oxygen saturation missing most frequently (n = 831; 23.6 %). Over 25 % (n = 909) of patients were missing at least one prehospital vital sign, of which 69.1 % (n = 628) were missing all four of these variables. Patients with incomplete data were more severely injured, had higher mortality, and more frequently received lifesaving interventions such as blood transfusion and intubation. Patients were most likely to have missing prehospital physiological data if they died in the trauma bay (unadjusted OR: 9.79; 95 % CI: 6.35-15.10), did not survive to discharge (unadjusted OR: 3.55; 95 % CI: 2.76-4.55), or had a prehospital GCS less than 9 (OR: 3.24; 95 % CI: 2.59-4.06).
    CONCLUSIONS: In this single center trauma registry, key prehospital variables were frequently missing, particularly among more severely injured patients. Patients with missing data had higher mortality, more severe injury characteristics and received more life-saving interventions in the trauma bay, suggesting an injury severity bias in prehospital vital sign missingness. To ensure the validity of research based on trauma registry data, patterns of missingness must be carefully considered to ensure missing data is appropriately addressed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:电子健康记录(EHR)被广泛用于开发临床预测模型(CPM)。然而,挑战之一是通常存在一定程度的信息缺失数据。例如,当临床医生担心需要时,通常会采取实验室措施。当数据是所谓的“随机不丢失”(NMAR)时,基于其他错误机制的分析策略是不合适的。在这项工作中,我们试图比较处理缺失数据的不同策略对CPM性能的影响。
    方法:我们考虑了住院患者快速恶化的预测模型作为一个范例。该模型结合了十二种具有不同程度的错误的实验室措施。五个实验室的错误率在50%左右,其他七个人的不良程度约为90%。我们基于这样的信念将它们包括在内,即它们的不良状态可以为预测提供高度信息。在我们的研究中,我们明确地比较了各种缺失数据的策略:均值填补,正常值插补,有条件的归责,分类编码,和错误嵌入。其中一些还与上次结转的观察结果(LOCF)相结合。我们实施了逻辑LASSO回归,多层感知器(MLP),和长期短期记忆(LSTM)模型作为下游分类器。我们比较了测试数据的AUROC,并使用自举构建了95%的置信区间。
    结果:我们有105,198例住院患者,4.7%的人经历了兴趣恶化的结果。LSTM模型通常优于其他横截面模型,其中嵌入方法和分类编码产生了最好的结果。对于横截面模型,用LOCF进行正常值填补产生了最好的结果。
    结论:考虑NMAR数据缺失可能性的策略比那些没有的策略产生了更好的模型性能。嵌入方法具有优势,因为它不需要事先的临床知识。使用LOCF可以增强横截面模型的性能,但在LSTM模型中有反差。
    BACKGROUND: Electronic Health Records (EHR) are widely used to develop clinical prediction models (CPMs). However, one of the challenges is that there is often a degree of informative missing data. For example, laboratory measures are typically taken when a clinician is concerned that there is a need. When data are the so-called Not Missing at Random (NMAR), analytic strategies based on other missingness mechanisms are inappropriate. In this work, we seek to compare the impact of different strategies for handling missing data on CPMs performance.
    METHODS: We considered a predictive model for rapid inpatient deterioration as an exemplar implementation. This model incorporated twelve laboratory measures with varying levels of missingness. Five labs had missingness rate levels around 50%, and the other seven had missingness levels around 90%. We included them based on the belief that their missingness status can be highly informational for the prediction. In our study, we explicitly compared the various missing data strategies: mean imputation, normal-value imputation, conditional imputation, categorical encoding, and missingness embeddings. Some of these were also combined with the last observation carried forward (LOCF). We implemented logistic LASSO regression, multilayer perceptron (MLP), and long short-term memory (LSTM) models as the downstream classifiers. We compared the AUROC of testing data and used bootstrapping to construct 95% confidence intervals.
    RESULTS: We had 105,198 inpatient encounters, with 4.7% having experienced the deterioration outcome of interest. LSTM models generally outperformed other cross-sectional models, where embedding approaches and categorical encoding yielded the best results. For the cross-sectional models, normal-value imputation with LOCF generated the best results.
    CONCLUSIONS: Strategies that accounted for the possibility of NMAR missing data yielded better model performance than those did not. The embedding method had an advantage as it did not require prior clinical knowledge. Using LOCF could enhance the performance of cross-sectional models but have countereffects in LSTM models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号