关键词: High dimensional data Imputation MAR MCAR MNAR Metabolomics Missing values RF

Mesh : Bias Chromatography, Liquid Humans Mass Spectrometry / methods statistics & numerical data Metabolomics / methods statistics & numerical data

来  源:   DOI:10.1186/s12859-019-3110-0   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
BACKGROUND: LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis.
RESULTS: Here we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin.
CONCLUSIONS: Type and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.
摘要:
背景:LC-MS技术使得可以在单次分析中测量样品的许多分子特征的相对丰度。然而,特别是非靶向代谢物谱分析方法产生大量数据,这些数据容易产生像缺失值这样的畸变。不管数据中缺失值的原因是什么,连贯和完整的数据矩阵始终是准确可靠的统计分析的先决条件。因此,需要适当的归责策略,以解决统计分析中的错误并减少偏差。
结果:在这里,我们在以四种不同百分比的不同来源的缺失值评估了九种归因方法后,提出了我们的结果。通过归一化均方根误差(NRMSE)分析了每种插补方法的性能。我们证明了随机森林(RF)在随机缺失(MAR)和完全随机缺失(MCAR)的缺失值估计中NRMSE最低。如果由于非随机缺失(MNAR)而导致缺少值,左侧截断的数据最好用最小值填补。我们还测试了包含各种来源缺失数据的数据集的不同插补方法,RF是所有情况下最准确的方法。通过使用代谢组学数据集重复评估过程100次获得结果,其中引入缺失值以表示不同来源的缺失数据。
结论:错误的类型和比率会影响估算方法的性能和适用性。基于RF的插补方法在大多数测试场景中表现最好,包括不同类型和错误率的组合。因此,我们建议使用基于随机森林的插补来估算缺失的代谢组学数据,尤其是在不事先知道错误类型的情况下。
公众号