missforest

missForest
  • 文章类型: Journal Article
    Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    每小时交通量,由自动交通记录仪(ATR)收集,是最重要的,因为它们被用来计算平均年每日交通量(AADT)和设计小时量(DHV)。因此,有必要确保收集数据的质量。不幸的是,ATR偶尔发生故障,导致数据缺失,以及不可靠的计数。这自然会影响从小时计数得出的关键参数的准确性。本研究旨在解决这一问题。来自新南威尔士州的ATR数据,对澳大利亚进行了违规行为和无效条目的筛选。随机选择了总共25%的可靠数据来测试13种不同的插补方法。数据遗漏的两种情况,即,25%和100%,进行了分析。结果表明,MissForest优于其他估算方法;因此,它被用来计算实际缺失的数据来完成数据集。AADT值是根据填补前的原始计数和填补后的完成计数计算得出的。来自估算数据的AADT值略高。绘制时的平均日体积验证了估算数据的质量,因为年度趋势显示出相对更好的拟合。
    Hourly traffic volumes, collected by automatic traffic recorders (ATRs), are of paramount importance since they are used to calculate average annual daily traffic (AADT) and design hourly volume (DHV). Hence, it is necessary to ensure the quality of the collected data. Unfortunately, ATRs malfunction occasionally, resulting in missing data, as well as unreliable counts. This naturally has an impact on the accuracy of the key parameters derived from the hourly counts. This study aims to solve this problem. ATR data from New South Wales, Australia was screened for irregularities and invalid entries. A total of 25% of the reliable data was randomly selected to test thirteen different imputation methods. Two scenarios for data omission, i.e., 25% and 100%, were analyzed. Results indicated that missForest outperformed other imputation methods; hence, it was used to impute the actual missing data to complete the dataset. AADT values were calculated from both original counts before imputation and completed counts after imputation. AADT values from imputed data were slightly higher. The average daily volumes when plotted validated the quality of imputed data, as the annual trends demonstrated a relatively better fit.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    缺失数据是大规模数据集中的常见问题,其适当的处理对于数据分析至关重要。Missingness可以归类为(1)完全随机丢失(MCAR),(2)随机缺失(MAR),(3)非随机缺失(MNAR)。不同的错误机制需要不同的归责策略。多重归责,一种在多个估算数据中平均结果的方法,比单一填补更适合处理各种缺失机制。MissForest,使用随机森林的非参数缺失值填补策略,是缺失数据最普遍的多重填补方法之一,因为它可以应用于混合类型的数据,并且不需要分布假设。然而,最近的一项研究发现,missForest会对非正态数据产生有偏差的结果。此外,missForest在计算上是昂贵的。
    因此,我们旨在通过结合基于二进制粒子群优化(BPSO)的特征选择策略来进一步开发missForest算法。
    BPSO是一种进化算法,以全局优化和计算效率而闻名。通过在使用missForest估算缺失值之前使用基于BPSO的特征选择步骤,通过修剪冗余变量,可以提高连续变量的填补精度。
    在这项研究中,通过在估算步骤之前进行特征选择,具有BPSO的missForest(BPSOmf)在连续变量方面比单独的missForest显示出更好的估算精度。
    当归集目标数据主要由连续变量组成时,BPSOmf是一种适当且可靠的方法。
    Missing data are a common problem in large-scale datasets and its appropriate handling is crucial for data analyses. Missingness can be categorized as (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR). Different missingness mechanisms require different imputation strategies. Multiple imputation, an approach for averaging outcomes across multiple imputed data, is more suitable than single imputation for dealing with various missing mechanisms. missForest, a nonparametric missing value imputation strategy using random forest, is one of the most prevalent multiple imputation methods for missing-data because it can be applied to mixed-type data and does not require distributional assumptions. However, a recent study found that missForest can produce biased results for non-normal data. In addition, missForest is computationally expensive.
    Therefore, we aimed to further develop the missForest algorithm by combining a binary particle swarm optimization (BPSO)-based feature-selection strategy.
    The BPSO is an evolutionary algorithm that is well known for global optimization and computational efficiency. By using the BPSO-based feature selection step prior to imputing missing values with missForest, the imputation accuracy for continuous variables could be increased by pruning redundant variables.
    In this study, missForest with BPSO (BPSOmf) showed better imputation accuracy than missForest alone with respect to continuous variables by feature selection prior to the imputation step.
    BPSOmf is an appropriate and robust method when the imputation target data consist mainly of continuous variables.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基于偏最小二乘法的多元校正,随机森林,和支持向量机方法,结合MissForest插补算法,用于了解臭氧和氮氧化物之间的相互作用,一氧化碳,风速,太阳辐射,温度,相对湿度,和其他人,这些数据是由里约热内卢市区的空气质量监测站在四个不同的地点收集的,2014年和,2018.这些技术提供了一种简单可行的方法来建模和分析空气污染物,并且可以与其他方法结合使用。结果表明,随机森林和支持向量机化学计量学技术可用于对流层臭氧浓度的建模和预测,根据空气质量监测站和季节的不同,预测的确定系数高达0.92,校准的均方根误差在4.66至27.15µgm-3之间,预测的均方根误差在4.17至22.45µgm-3之间。
    Multivariate calibration based on partial least squares, random forest, and support vector machine methods, combined with the MissForest imputation algorithm, was used to understand the interaction between ozone and nitrogen oxides, carbon monoxide, wind speed, solar radiation, temperature, relative humidity, and others, the data of which were collected by air quality monitoring stations in the metropolitan area of Rio de Janeiro in four distinct sites between, 2014 and, 2018. These techniques provide an easy and feasible way of modeling and analyzing air pollutants and can be used when coupled with other methods. The results showed that random forest and support vector machine chemometric techniques can be used in modeling and predicting tropospheric ozone concentrations, with a coefficient of determination for making predictions up to 0.92, a root-mean square error of calibration between 4.66 and 27.15 µg m-3, and a root-mean square error of prediction between 4.17 and 22.45 µg m-3, depending on the air quality monitoring stations and season.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    神经心理学评分和功能活动问卷(FAQ)对测量阿尔茨海默病患者的认知和功能域具有重要意义。Further,今天有标准化的数据集,这些数据集来自全球多个中心,有助于开发计算机辅助诊断工具。然而,有许多临床测试来衡量这些分数,这导致了一项具有挑战性的任务,以评估他们的诊断。此外,数据集存在常见的数据缺失和不平衡问题。在本文中,我们提出了一个基于机器学习的框架来克服这些问题。经验结果表明,遗传算法在MissForest填充后的神经心理学得分和FAQ得分上的性能得到了改善。
    The neuropsychological scores and Functional Activities Questionnaire (FAQ) are significant to measure the cognitive and functional domain of the patients affected by the Alzheimer\'s Disease. Further, there are standardized dataset available today that are curated from several centers across the globe that aid in development of Computer Aided Diagnosis tools. However, there are numerous clinical tests to measure these scores that lead to a challenging task for their assessment in diagnosis. Also, the datasets suffer from common missing and imbalanced data issues. In this paper, we propose a machine learning based framework to overcome these issues. Empirical results demonstrate that improved performance of Genetic Algorithm is obtained for the neuropsychological scores after Miss Forest Imputation and for FAQ scores is obtained after subjecting it to the Synthetic Minority Oversampling Technique.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号