关键词: Clinical data Imputation Machine learning Multivariate

Mesh : Autistic Disorder / genetics Bayes Theorem Child Data Collection / methods Humans

来  源:   DOI:10.1186/s12874-022-01656-z

Abstract:
An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, e.g. Dev. Cogn. Neur. 32:43-54, 2018; PLoS Med. 12(3):e1001779, 2015; Elam and Van Essen, Enc. Comp. Neur., 2013, as well as in psychiatric cohorts, e.g. Trans. Psych. 10(1):100, 2020; Mol. Psych. 19:659-667, 2014; Mol. Aut. 8:24, 2017; Eur. Child and Adol. Psych. 24(3):265-281, 2015. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to integrate relationships across multiple measures. Here we aim to evaluate different imputation strategies to fill in missing values in clinical data from a large (total N = 764) and deeply phenotyped (i.e. range of clinical and cognitive instruments administered) sample of N = 453 autistic individuals and N = 311 control individuals recruited as part of the EU-AIMS Longitudinal European Autism Project (LEAP) consortium. In particular, we consider a total of 160 clinical measures divided in 15 overlapping subsets of participants. We use two simple but common univariate strategies-mean and median imputation-as well as a Round Robin regression approach involving four independent multivariate regression models including Bayesian Ridge regression, as well as several non-linear models: Decision Trees (Extra Trees., and Nearest Neighbours regression. We evaluate the models using the traditional mean square error towards removed available data, and also consider the Kullback-Leibler divergence between the observed and the imputed distributions. We show that all of the multivariate approaches tested provide a substantial improvement compared to typical univariate approaches. Further, our analyses reveal that across all 15 data-subsets tested, an Extra Trees regression approach provided the best global results. This not only allows the selection of a unique model to impute missing data for the LEAP project and delivers a fixed set of imputed clinical data to be used by researchers working with the LEAP dataset in the future, but provides more general guidelines for data imputation in large scale epidemiological studies.
摘要:
在典型的发展中人口中开展了越来越多的大规模多模式研究活动,例如开发。Cogn.Neur.32:43-54,2018;PLoSMed。12(3):e1001779,2015;Elam和VanEssen,Enc.Comp.Neur.,2013年,以及在精神病队列中,例如Trans.心理10(1):100,2020年;摩尔。心理19:659-667,2014;Mol。Aut.2017年8:24;欧元。孩子和Adol心理24(3):265-281,2015。由于难以评估大量参与者的多种措施,因此缺少数据是此类数据集中的常见问题。当研究人员旨在整合多个指标之间的关系时,数据缺失的后果就会累积起来。在这里,我们旨在评估不同的填补策略,以填补来自N=453个自闭症个体和N=311个对照个体的大量(总计N=764)和深度表型(即所施用的临床和认知工具范围)样本的临床数据中的缺失值作为EU-AIMS纵向欧洲自闭症项目(LEAP)联盟的一部分。特别是,我们考虑总共160项临床措施,分为15个重叠的参与者亚组.我们使用两种简单但常见的单变量策略-均值和中位数插补-以及涉及四个独立的多元回归模型的RoundRobin回归方法,包括贝叶斯岭回归,以及几个非线性模型:决策树(额外的树。,和最近的邻居回归。我们使用传统的均方误差对删除的可用数据进行评估,并考虑了观测分布和估算分布之间的Kullback-Leibler分歧。我们表明,与典型的单变量方法相比,所有测试的多变量方法都提供了实质性的改进。Further,我们的分析表明,在所有15个数据子集测试中,额外的树木回归方法提供了最好的全局结果。这不仅允许选择一个独特的模型来为LEAP项目估算缺失的数据,并提供一组固定的估算临床数据,供将来使用LEAP数据集的研究人员使用。但为大规模流行病学研究中的数据填补提供了更一般的指导。
公众号