missing data

数据缺失
  • 文章类型: Journal Article
    选择辅助变量是正确实施缺失数据方法的重要第一步,例如完全信息最大似然(FIML)估计或多重插补。然而,选择有用的辅助变量的实用指南和统计检验有些缺乏,导致潜在的偏差估计。我们建议使用随机森林分析和套索回归作为选择辅助变量的替代方法,特别是在缺失数据模式是非线性或复杂的情况下(即,变量和错误之间的交互关系)。蒙特卡罗模拟证明了与传统方法相比,随机森林分析和套索回归的有效性(t检验,Little\'sMCAR测试,逻辑回归),在选择辅助变量和当将所述辅助变量并入具有缺失数据的分析中时所述辅助变量的性能方面。两种技术都优于传统方法,为改进统计分析中处理缺失数据的实用方法提供了有希望的方向。
    The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little\'s MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于测量稀疏且容易出错,因此在健康研究中分析纵向数据具有挑战性,强烈的个体内部相关性,缺少数据和各种轨迹形状。虽然混合效应模型(MM)有效地解决了这些挑战,它们仍然是参数模型,可能会产生计算成本。相比之下,功能主成分分析(FPCA)是一种非参数方法,用于常规和密集的功能数据,可以以潜在的较低的计算成本灵活地描述时间轨迹。本文提供了一个经验模拟研究,评估了具有稀疏和易错重复度量的FPCA的行为及其在不同缺失数据方案下的鲁棒性,并与MM进行了比较。结果表明,FPCA非常适合在随机数据丢失的情况下,除了涉及最频繁和系统辍学的情况。像MM一样,FPCA在没有随机机制的情况下失败。FPCA用于描述临床痴呆前四种认知功能的轨迹,并在基于人群的衰老队列中嵌套的病例对照研究中将其与匹配的对照进行对比。未来痴呆症病例的平均认知能力下降显示出与匹配对照组的突然差异,在诊断前5至2.5年急剧加速。
    Analyzing longitudinal data in health studies is challenging due to sparse and error-prone measurements, strong within-individual correlation, missing data and various trajectory shapes. While mixed-effect models (MM) effectively address these challenges, they remain parametric models and may incur computational costs. In contrast, functional principal component analysis (FPCA) is a non-parametric approach developed for regular and dense functional data that flexibly describes temporal trajectories at a potentially lower computational cost. This article presents an empirical simulation study evaluating the behavior of FPCA with sparse and error-prone repeated measures and its robustness under different missing data schemes in comparison with MM. The results show that FPCA is well-suited in the presence of missing at random data caused by dropout, except in scenarios involving most frequent and systematic dropout. Like MM, FPCA fails under missing not at random mechanism. The FPCA was applied to describe the trajectories of four cognitive functions before clinical dementia and contrast them with those of matched controls in a case-control study nested in a population-based aging cohort. The average cognitive declines of future dementia cases showed a sudden divergence from those of their matched controls with a sharp acceleration 5 to 2.5 years prior to diagnosis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:早期识别患有近视的高风险儿童对于通过引入及时的干预措施来预防近视进展至关重要。然而,数据缺失和测量误差(ME)是风险预测建模中的常见挑战,可能会在近视预测中引入偏差。
    方法:我们探索了四种填补方法来解决缺失数据和ME:单一填补(SI),随机缺失下的多重插补(MI-MAR),带校准程序的多重归因(MI-ME),以及非随机缺失下的多重填补(MI-MNAR)。我们比较了四种机器学习模型(决策树,天真的贝叶斯,随机森林,和Xgboost)和三个统计模型(逻辑回归,逐步Logistic回归,和最小绝对收缩和选择算子逻辑回归)在近视风险预测中的应用。我们将这些模型应用于上海金山近视队列研究,并进行了模拟研究,以调查缺失机制的影响,我的程度,以及预测因素对模型性能的重要性。使用接受者工作特征曲线(AUROC)和精确召回曲线下面积(AUPRC)评估模型性能。
    结果:我们的研究结果表明,在缺少数据和ME的情况下,使用MI-ME结合逻辑回归可获得最佳预测结果。在没有ME的情况下,无论缺少的机制如何,采用MI-MAR来处理丢失的数据都优于SI。当ME对预测的影响大于缺失数据时,MI-MAR的相对优势减弱,和MI-ME变得更加优越。此外,我们的结果表明,统计模型比机器学习模型表现出更好的预测性能.
    结论:MI-ME成为处理缺失数据的可靠方法,并且是早发性近视风险预测的重要预测因子。
    BACKGROUND: Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction.
    METHODS: We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC).
    RESULTS: Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models.
    CONCLUSIONS: MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    时间序列数据记录在各个扇区中,导致大量的数据。然而,这些数据的连续性经常被中断,导致数据缺失的时期。使用几种算法来估算丢失的数据,这些方法的性能差异很大。除了算法的选择,有效的填补取决于缺失和可用数据的性质。我们使用不同类型的时间序列数据进行了广泛的研究,特别是心率数据和功耗数据。我们生成了不同时间跨度的缺失数据,并使用不同的算法对不同大小的数据进行了推算。使用均方根误差(RMSE)度量来评估性能。与整个数据集相比,我们观察到使用分箱数据时的RMSE降低,特别是在期望最大化(EM)算法的情况下。我们发现,当使用分箱数据时,RMSE降低了1-,5-,和15分钟的缺失数据,对于15分钟的缺失数据观察到更大的减少。我们还观察到了数据波动的影响。我们得出的结论是,打包数据的有用性完全取决于缺失数据的跨度,数据的采样频率,和数据内部的波动。根据固有特征,质量,以及缺失和可用数据的数量,打包的数据可以计算各种各样的数据,包括来自物联网(IoT)设备智能手表的生物心率数据和非生物数据,例如家庭功耗数据。
    Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:缺失的数据在观察性研究中很常见,并且经常发生在估计因果效应时所需要的几个变量中,即曝光,用于控制混杂的结果和/或变量。涉及多个不完整变量的分析不如具有单个不完整变量的分析那么简单。例如,在多变量错误的背景下,标准缺失数据假设(“完全随机缺失”,\“随机丢失\”[MAR],“不随机遗漏”)难以解释和评估。尚不清楚在实践中如何解决由于多变量错误而引起的复杂性。这项研究的目的是回顾如何在使用多重归因(MI)进行因果效应估计的观察性研究中管理和报告缺失的数据。特别关注缺失的数据摘要,缺失数据假设,主要和敏感性分析,和MI实施。
    方法:我们搜索了5种顶级一般流行病学期刊的观察性研究,旨在回答因果研究问题,并使用MI,在2019年1月至2021年12月之间发布。系统进行了文章筛选和数据提取。
    结果:本综述纳入的130项研究中,108(83%)通过排除特定变量中数据缺失的个体来得出分析样本(例如,结果)和114(88%)在分析样本中有多变量错误。44项(34%)研究提供了关于缺失数据假设的声明,其中35陈述了MAR假设,但只有11/44(25%)的研究为这些假设提供了理由。估算的数量,MI方法和MI软件报告普遍良好(71%,75%和88%的研究,分别),而在超过一半的研究中,归因模型规范的各个方面都不清楚。在69/130(53%)研究中进行了使用不同方法处理缺失数据的二次分析。在这69项研究中,68(99%)缺乏对二次分析的明确理由。
    结论:需要努力澄清MI的基本原理,并改进MI的报告,以便从观察数据中估计因果效应。我们鼓励在制定和报告与缺失数据相关的分析决策时提高透明度。
    BACKGROUND: Missing data are common in observational studies and often occur in several of the variables required when estimating a causal effect, i.e. the exposure, outcome and/or variables used to control for confounding. Analyses involving multiple incomplete variables are not as straightforward as analyses with a single incomplete variable. For example, in the context of multivariable missingness, the standard missing data assumptions (\"missing completely at random\", \"missing at random\" [MAR], \"missing not at random\") are difficult to interpret and assess. It is not clear how the complexities that arise due to multivariable missingness are being addressed in practice. The aim of this study was to review how missing data are managed and reported in observational studies that use multiple imputation (MI) for causal effect estimation, with a particular focus on missing data summaries, missing data assumptions, primary and sensitivity analyses, and MI implementation.
    METHODS: We searched five top general epidemiology journals for observational studies that aimed to answer a causal research question and used MI, published between January 2019 and December 2021. Article screening and data extraction were performed systematically.
    RESULTS: Of the 130 studies included in this review, 108 (83%) derived an analysis sample by excluding individuals with missing data in specific variables (e.g., outcome) and 114 (88%) had multivariable missingness within the analysis sample. Forty-four (34%) studies provided a statement about missing data assumptions, 35 of which stated the MAR assumption, but only 11/44 (25%) studies provided a justification for these assumptions. The number of imputations, MI method and MI software were generally well-reported (71%, 75% and 88% of studies, respectively), while aspects of the imputation model specification were not clear for more than half of the studies. A secondary analysis that used a different approach to handle the missing data was conducted in 69/130 (53%) studies. Of these 69 studies, 68 (99%) lacked a clear justification for the secondary analysis.
    CONCLUSIONS: Effort is needed to clarify the rationale for and improve the reporting of MI for estimation of causal effects from observational data. We encourage greater transparency in making and reporting analytical decisions related to missing data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    缺少协变量数据是一个常见问题,在基因表达的观察性研究中尚未解决。这里,我们提出了一种多重插补方法,通过将转录组的主成分分析纳入多重插补预测模型以避免偏差,从而适应高维基因表达数据。使用三个数据集的模拟研究表明,该方法在发现真正的阳性差异表达基因方面优于完整案例和单一插补分析。限制错误发现率,最小化偏差。此方法很容易通过RBioconductor包实现,RNAseqCovarImpute与limma-voom管道集成,用于差异表达分析。
    Missing covariate data is a common problem that has not been addressed in observational studies of gene expression. Here, we present a multiple imputation method that accommodates high dimensional gene expression data by incorporating principal component analysis of the transcriptome into the multiple imputation prediction models to avoid bias. Simulation studies using three datasets show that this method outperforms complete case and single imputation analyses at uncovering true positive differentially expressed genes, limiting false discovery rates, and minimizing bias. This method is easily implemented via an R Bioconductor package, RNAseqCovarImpute that integrates with the limma-voom pipeline for differential expression analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    河流盐度数据的稀缺性对理解盐度动态及其对全球缺水易盐地区供水管理的影响构成了挑战。本文介绍了一种框架,用于使用基于实例的迁移学习(TL)生成连续的每日流盐度估计,并通过预测间隔(PI)通过不确定性量化来评估合成盐度数据的可靠性。该框架是使用来自俄克拉荷马州西南部和德克萨斯州Panhandle的上红河流域(URRB)的两个时间上不同的比电导(SC)数据集开发的。美国。基于实例的TL方法是通过在1959年至1993年由美国地质调查局(USGS)收集的大约1200个瞬时抓取样本的源SC数据集上校准前馈神经网络(FFNN)来实现的。随后在俄克拉荷马州水资源委员会(OWRB)收集的220个瞬时抓取样品的目标数据集(1998年至今)上测试了经过训练的FFNN。在俄克拉荷马州数据丰富的BirdCreek分水岭中,通过操纵连续的SC数据来模拟训练模型的数据稀缺条件,并使用完整的BirdCreek数据集进行模型评估,评估了框架的可泛化性。下限上限估计(LUBE)方法与FFNN一起使用来估计用于不确定性量化的PI。通过FFNN的自回归SC预测方法被发现是可靠的,在样本内和样本外测试数据上,纳什·萨特克利夫效率(NSE)值为0.65和0.45。分别。相同的建模方案使用类似的缺失数据比率,导致BirdCreek数据的NSE为0.54,而较高的观察数据比率提高了准确性(NSE=0.84)。URRB中北叉红河的相对狭窄的估计PI表明令人满意的河流盐度预测,显示平均宽度相当于观测范围的25%,置信水平为70%。
    Scarcity of stream salinity data poses a challenge to understanding salinity dynamics and its implications for water supply management in water-scarce salt-prone regions around the world. This paper introduces a framework for generating continuous daily stream salinity estimates using instance-based transfer learning (TL) and assessing the reliability of the synthetic salinity data through uncertainty quantification via prediction intervals (PIs). The framework was developed using two temporally distinct specific conductance (SC) datasets from the Upper Red River Basin (URRB) located in southwestern Oklahoma and Texas Panhandle, United States. The instance-based TL approach was implemented by calibrating Feedforward Neural Networks (FFNNs) on a source SC dataset of around 1200 instantaneous grab samples collected by United States Geological Survey (USGS) from 1959 to 1993. The trained FFNNs were subsequently tested on a target dataset (1998-present) of 220 instantaneous grab samples collected by the Oklahoma Water Resources Board (OWRB). The framework\'s generalizability was assessed in the data-rich Bird Creek watershed in Oklahoma by manipulating continuous SC data to simulate data-scarce conditions for training the models and using the complete Bird Creek dataset for model evaluation. The Lower Upper Bound Estimation (LUBE) method was used with FFNNs to estimate PIs for uncertainty quantification. Autoregressive SC prediction methods via FFNN were found to be reliable with Nash Sutcliffe Efficiency (NSE) values of 0.65 and 0.45 on in-sample and out-of-sample test data, respectively. The same modeling scenario resulted in an NSE of 0.54 for the Bird Creek data using a similar missing data ratio, whereas a higher ratio of observed data increased the accuracy (NSE = 0.84). The relatively narrow estimated PIs for the North Fork Red River in the URRB indicated satisfactory stream salinity predictions, showing an average width equivalent to 25 % of the observed range and a confidence level of 70 %.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    当使用多重插补时,用户通常想知道他们需要多少估算。一个古老的答案是,2-10个推算通常就足够了,但此建议仅涉及点估计的效率。你可能需要更多的指责,如果,除了有效的点估计,您还需要标准误差(SE)估计,如果您再次估算数据,则不会更改(太多)。对于可复制的SE估计,所需的插补数量随着缺失信息的比例呈二次增加(不是线性的,正如以前的研究所建议的那样)。我建议采用两阶段程序,在该程序中,您使用小到中等数量的估算进行试点分析,然后使用结果计算最终分析所需的插补数量,其SE估计将具有所需的可复制性水平。我使用名为%mi_combine的新SAS宏和名为how_many_imputations的新Stata命令来实现两阶段过程。
    When using multiple imputation, users often want to know how many imputations they need. An old answer is that 2-10 imputations usually suffice, but this recommendation only addresses the efficiency of point estimates. You may need more imputations if, in addition to efficient point estimates, you also want standard error (SE) estimates that would not change (much) if you imputed the data again. For replicable SE estimates, the required number of imputations increases quadratically with the fraction of missing information (not linearly, as previous studies have suggested). I recommend a two-stage procedure in which you conduct a pilot analysis using a small-to-moderate number of imputations, then use the results to calculate the number of imputations that are needed for a final analysis whose SE estimates will have the desired level of replicability. I implement the two-stage procedure using a new SAS macro called %mi_combine and a new Stata command called how_many_imputations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在多重插补(MI)中使用辅助变量以减少偏差并提高效率。这些变量本身往往是不完整的。我们探讨了辅助变量中的缺失数据如何影响从MI获得的估计值。我们使用三种不同的缺失数据机制对结果进行了模拟研究。然后,我们研究了缺失数据比例的增加以及辅助变量的不同缺失机制对未调整线性回归系数的偏差和缺失信息比例的影响。我们用雅芳父母和孩子纵向研究中的一个应用例子来说明我们的发现。我们发现,在完整记录分析有偏见的地方,辅助变量中缺失数据的比例越来越高,在任何缺失的数据机制下,降低了MI的能力,包括辅助变量来减轻这种偏差。在完整记录分析中没有偏见的地方,在MI中包含非随机辅助变量的缺失引入了潜在重要程度的偏差(在我们的模拟中,高达效应大小的17%).在选择用于MI模型时,需要仔细考虑辅助变量中缺失数据的数量和性质。
    Auxiliary variables are used in multiple imputation (MI) to reduce bias and increase efficiency. These variables may often themselves be incomplete. We explored how missing data in auxiliary variables influenced estimates obtained from MI. We implemented a simulation study with three different missing data mechanisms for the outcome. We then examined the impact of increasing proportions of missing data and different missingness mechanisms for the auxiliary variable on bias of an unadjusted linear regression coefficient and the fraction of missing information. We illustrate our findings with an applied example in the Avon Longitudinal Study of Parents and Children. We found that where complete records analyses were biased, increasing proportions of missing data in auxiliary variables, under any missing data mechanism, reduced the ability of MI including the auxiliary variable to mitigate this bias. Where there was no bias in the complete records analysis, inclusion of a missing not at random auxiliary variable in MI introduced bias of potentially important magnitude (up to 17% of the effect size in our simulation). Careful consideration of the quantity and nature of missing data in auxiliary variables needs to be made when selecting them for use in MI models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:彼得·克拉克(PC)算法是一种流行的因果发现方法,用于以数据驱动的方式学习因果图。直到最近,R中现有的PC算法实现在缺失值方面有重要的限制,时间结构或混合测量尺度(分类/连续),这些都是队列数据的共同特征。这里介绍的新R包,micd和tpc,填补这些空白。
    方法:micd和tpc包是R包。
    micd软件包提供了附加功能,用于处理现有pcalgR软件包的缺失值,包括依赖于随机缺失假设的多重插补方法。此外,micd允许假设条件高斯的混合测量尺度。tpc包有效地利用时间信息的方式,导致更多的信息输出,更不容易出现统计错误。
    背景:tpc和micd软件包可在综合R存档网络(CRAN)上免费获得。它们的源代码也可以在GitHub上找到(https://github.com/bips-hb/micd;https://github.com/bips-hb/tpc)。
    BACKGROUND: The Peter Clark (PC) algorithm is a popular causal discovery method to learn causal graphs in a data-driven way. Until recently, existing PC algorithm implementations in R had important limitations regarding missing values, temporal structure or mixed measurement scales (categorical/continuous), which are all common features of cohort data. The new R packages presented here, micd and tpc, fill these gaps.
    METHODS: micd and tpc packages are R packages.
    UNASSIGNED: The micd package provides add-on functionality for dealing with missing values to the existing pcalg R package, including methods for multiple imputations relying on the Missing At Random assumption. Also, micd allows for mixed measurement scales assuming conditional Gaussianity. The tpc package efficiently exploits temporal information in a way that results in a more informative output that is less prone to statistical errors.
    BACKGROUND: The tpc and micd packages are freely available on the Comprehensive R Archive Network (CRAN). Their source code is also available on GitHub (https://github.com/bips-hb/micd; https://github.com/bips-hb/tpc).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号