Covariate shift

协变量移位
  • 文章类型: Journal Article
    在许多现代机器学习应用中,协变量分布的变化和获取结果信息的难度对稳健的模型训练和评估提出了挑战。已经开发了许多迁移学习方法,以使用源种群中的现有标记数据将模型本身鲁棒地适应一些未标记的目标种群。然而,关于转移绩效指标的文献很少,特别是接收机工作特性(ROC)参数,一个经过训练的模型。在本文中,我们旨在基于ROC分析评估经过训练的二元分类器对未标记目标人群的性能.我们提出了半监督传输精度度量(STEAM),一种有效的三步估计程序,采用(1)双指数建模来构建校准的密度比权重和(2)稳健的插补来利用大量未标记的数据来提高估计效率。在密度比模型或结果模型的正确规范下,我们建立了所提出的估计器的一致性和渐近正态。我们还通过交叉验证校正了有限样本中估计器的潜在过拟合偏差。我们将我们提出的估计器与现有方法进行了比较,并通过模拟显示了偏差的减少和效率的提高。我们说明了所提出的方法在评估随时间发展的EHR队列中类风湿关节炎(RA)表型模型的预测性能方面的实际实用性。
    In many modern machine learning applications, changes in covariate distributions and difficulty in acquiring outcome information have posed challenges to robust model training and evaluation. Numerous transfer learning methods have been developed to robustly adapt the model itself to some unlabeled target populations using existing labeled data in a source population. However, there is a paucity of literature on transferring performance metrics, especially receiver operating characteristic (ROC) parameters, of a trained model. In this paper, we aim to evaluate the performance of a trained binary classifier on unlabeled target population based on ROC analysis. We proposed Semisupervised Transfer lEarning of Accuracy Measures (STEAM), an efficient three-step estimation procedure that employs (1) double-index modeling to construct calibrated density ratio weights and (2) robust imputation to leverage the large amount of unlabeled data to improve estimation efficiency. We establish the consistency and asymptotic normality of the proposed estimator under the correct specification of either the density ratio model or the outcome model. We also correct for potential overfitting bias in the estimators in finite samples with cross-validation. We compare our proposed estimators to existing methods and show reductions in bias and gains in efficiency through simulations. We illustrate the practical utility of the proposed method on evaluating prediction performance of a phenotyping model for rheumatoid arthritis (RA) on a temporally evolving EHR cohort.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    预测结果集-而不是独特的结果-是统计学习中不确定性量化的有前途的解决方案。尽管有丰富的文献来构建具有统计保证的预测集,适应未知的协变量转变-实践中普遍存在的问题-提出了一个严重的未解决的挑战。在这篇文章中,我们证明了具有有限样本覆盖保证的预测集是无信息的,并提出了一种新颖的灵活的无分布方法,PredSet-1Step,在未知协变量移位下,有效地构造具有渐近覆盖保证的预测集。我们正式证明了我们的方法在渐近上可能是近似正确的,具有良好校准的覆盖误差,对于大样本具有高置信度。我们说明,在南非队列研究中,它在许多实验和有关HIV风险预测的数据集中实现了名义上的覆盖率。我们的理论取决于基于一般渐近线性估计的Wald置信区间覆盖的收敛速度的新界限。
    Predicting sets of outcomes-instead of unique outcomes-is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift-a prevalent issue in practice-poses a serious unsolved challenge. In this article, we show that prediction sets with finite-sample coverage guarantee are uninformative and propose a novel flexible distribution-free method, PredSet-1Step, to efficiently construct prediction sets with an asymptotic coverage guarantee under unknown covariate shift. We formally show that our method is asymptotically probably approximately correct, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Review
    背景:概念漂移和协变量移位导致机器学习(ML)模型的退化。我们研究的目的是描述由COVID大流行引起的突然数据漂移。此外,我们研究了某些方法在模型训练中的适用性,以防止数据漂移导致的模型退化。
    方法:我们在包含2014-2019年收集的102,666例手术患者的数据集上使用H2OAutoML方法训练了不同的ML模型,以使用术前可用数据预测术后死亡率。应用的模型是带正则化的广义线性模型,默认随机森林,梯度增压机,极限梯度提升,包含所有基础模型的深度学习和堆叠集成。Further,在原始大流行前数据集上训练时,我们通过应用三种不同的方法修改了原始模型:(RahmaniK,etal,IntJMedInform173:104930,2023)我们对较旧的数据加权较弱,(莫尔格A,etal,SciRep12:7244,2022)仅使用最新数据进行模型训练,并且(DilmeganiC,2023)对数值输入参数进行了z变换。之后,我们在训练过程中未使用的大流行前和大流行内数据集上测试了模型性能,并分析了共同特征。
    结果:在2020年1月至3月的数据集上进行测试时,所产生的模型显示出接收器工作特征和可接受的精度召回曲线下的出色区域,但在2020年4月至5月的第一波COVID大流行中收集的数据集上进行测试时,显示出显着下降。当比较输入参数的概率分布时,大流行前和大流行内数据之间存在显著差异.我们模型的端点,手术后住院死亡率,大流行前和大流行数据之间没有显着差异,每个病例中约为1%。然而,模型的输入参数组成差异很大。我们所应用的修改都没有防止性能损失,尽管从中出现了非常不同的模型,使用各种各样的参数。
    结论:我们的结果表明,我们在模型训练中经过测试的易于实施的措施都不能防止在突然发生的外部事件的情况下恶化。因此,我们的结论是,在存在概念漂移和协变量移位的情况下,有必要对模型预测进行密切监测和严格审查。
    BACKGROUND: Concept drift and covariate shift lead to a degradation of machine learning (ML) models. The objective of our study was to characterize sudden data drift as caused by the COVID pandemic. Furthermore, we investigated the suitability of certain methods in model training to prevent model degradation caused by data drift.
    METHODS: We trained different ML models with the H2O AutoML method on a dataset comprising 102,666 cases of surgical patients collected in the years 2014-2019 to predict postoperative mortality using preoperatively available data. Models applied were Generalized Linear Model with regularization, Default Random Forest, Gradient Boosting Machine, eXtreme Gradient Boosting, Deep Learning and Stacked Ensembles comprising all base models. Further, we modified the original models by applying three different methods when training on the original pre-pandemic dataset: (Rahmani K, et al, Int J Med Inform 173:104930, 2023) we weighted older data weaker, (Morger A, et al, Sci Rep 12:7244, 2022) used only the most recent data for model training and (Dilmegani C, 2023) performed a z-transformation of the numerical input parameters. Afterwards, we tested model performance on a pre-pandemic and an in-pandemic data set not used in the training process, and analysed common features.
    RESULTS: The models produced showed excellent areas under receiver-operating characteristic and acceptable precision-recall curves when tested on a dataset from January-March 2020, but significant degradation when tested on a dataset collected in the first wave of the COVID pandemic from April-May 2020. When comparing the probability distributions of the input parameters, significant differences between pre-pandemic and in-pandemic data were found. The endpoint of our models, in-hospital mortality after surgery, did not differ significantly between pre- and in-pandemic data and was about 1% in each case. However, the models varied considerably in the composition of their input parameters. None of our applied modifications prevented a loss of performance, although very different models emerged from it, using a large variety of parameters.
    CONCLUSIONS: Our results show that none of our tested easy-to-implement measures in model training can prevent deterioration in the case of sudden external events. Therefore, we conclude that, in the presence of concept drift and covariate shift, close monitoring and critical review of model predictions are necessary.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    深度神经网络在图像分类任务上的卓越性能依赖于大规模高质量数据集。然而,从现实世界收集的数据集通常在分布上有偏差,这将导致模型性能的急剧下降,主要是因为不平衡的分布导致先验移位和协变量移位。最近的研究通常使用由两种再平衡策略组成的两阶段学习方法来解决这些问题,但是部分再平衡策略的组合将损害网络的表示能力。此外,两阶段学习方法对解决协变量移位问题几乎没有帮助。为了解决上述两个问题,我们首先提出了一种样本Logit感知重加权方法,称为(SLA),这不仅可以修复多数类硬样本和少数类样本的权重,还可以与logit调整相结合,形成稳定的两阶段学习策略。第二,为了解决协变量移位问题,受到合奏学习的启发,我们提出了一种多领域专家专业化模型,这可以通过平均来自多个不同领域的专家分类结果来实现更全面的决策。最后,我们将SLA和logit调整组合为两阶段学习方法,并将我们的模型应用于CIFAR-LT和ImageNet-LT数据集。与最先进的方法相比,我们的实验结果表明了优异的性能。
    The excellent performance of deep neural networks on image classification tasks depends on a large-scale high-quality dataset. However, the datasets collected from the real world are typically biased in their distribution, which will lead to a sharp decline in model performance, mainly because an imbalanced distribution results in the prior shift and covariate shift. Recent studies have typically used a two-stage learning method consisting of two rebalancing strategies to solve these problems, but the combination of partial rebalancing strategies will damage the representational ability of the networks. In addition, the two-stage learning method is of little help in addressing the problem of covariate shift. To solve the above two issues, we first propose a sample logit-aware reweighting method called (SLA), which can not only repair the weights of majority class hard samples and minority class samples but will also integrate with logit adjustment to form a stable two-stage learning strategy. Second, to solve the covariate shift problem, inspired by ensemble learning, we propose a multi-domain expert specialization model, which can achieve a more comprehensive decision by averaging expert classification results from multiple different domains. Finally, we combine SLA and logit adjustment into a two-stage learning method and apply our model to the CIFAR-LT and ImageNet-LT datasets. Compared with the most advanced methods, our experimental results show excellent performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    个体化治疗效果是精准医学的核心。由于其直观的吸引力和透明度,可解释的个性化治疗规则(ITR)对于临床医生或决策者来说是可取的。估计ITR的金标准方法是随机实验,其中受试者被随机分配到不同的治疗组,并且混淆偏差尽可能最小化。然而,实验研究由于其选择限制而受到外部有效性的限制,因此,基础研究人群不能代表目标现实世界人群。仅基于实验数据的目标人群的最佳可解释ITR的常规学习方法是有偏差的。另一方面,现实世界数据(RWD)正在变得流行,并提供了目标人群的代表性样本。要学习可推广的最佳可解释ITR,我们提出了一种基于加权方案的综合迁移学习方法,以将实验的协变量分布校准为RWD的协变量分布。理论上,我们为拟议的ITR估计器建立了风险一致性。根据经验,我们通过模拟来评估迁移学习者的有限样本性能,并将其应用于作业培训计划的实际数据应用。
    Individualized treatment effect lies at the heart of precision medicine. Interpretable individualized treatment rules (ITRs) are desirable for clinicians or policymakers due to their intuitive appeal and transparency. The gold-standard approach to estimating the ITRs is randomized experiments, where subjects are randomized to different treatment groups and the confounding bias is minimized to the extent possible. However, experimental studies are limited in external validity because of their selection restrictions, and therefore the underlying study population is not representative of the target real-world population. Conventional learning methods of optimal interpretable ITRs for a target population based only on experimental data are biased. On the other hand, real-world data (RWD) are becoming popular and provide a representative sample of the target population. To learn the generalizable optimal interpretable ITRs, we propose an integrative transfer learning method based on weighting schemes to calibrate the covariate distribution of the experiment to that of the RWD. Theoretically, we establish the risk consistency for the proposed ITR estimator. Empirically, we evaluate the finite-sample performance of the transfer learner through simulations and apply it to a real data application of a job training program.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    机器学习模型容易受到外部影响,这可能导致性能下降。我们研究的目的是阐明协变量突然变化的影响,就像Covid-19大流行引起的那样,关于模型性能。
    经过临床试验的伦理批准和注册(NCT04092933,初始版本17/09/2019),我们基于术前数据开发了预测围手术期死亡率的不同模型:一个模型用于大流行前数据期至2020年3月,一个模型包括大流行前和第一波至2020年5月的数据,另一个模型涵盖大流行前和大流行期间至2021年10月的完整时间段.我们应用了XGBoost以及深度学习神经网络(DL)。确定了每个模型在不同大流行阶段的性能指标,和XGBoost模型分析了特征重要性的变化。
    XGBoost和DL在大流行前数据上提供了关于接收器工作特性下区域的类似性能(AUROC,0.951vs.0.942)和精确召回曲线下的面积(AUPR,0.144vs.0.187)。在不同大流行波的患者队列中的验证显示,DL的AUROC和AUPR的表现波动很大,而XGBoost模型似乎更稳定。随着大流行的发作,可变频率的变化在年龄上可见,ASA得分,紧急行动的比例越高,在其他人中。年龄始终显示出最高的信息增益。基于大流行前数据的模型在第一波大流行期间表现较差(XGBoost和DL为AUROC0.914),而用第一波数据增强的模型在第一波之后缺乏性能(XGBoost为AUROC0.907,DL为0.747)。在AUPR中也可以看到恶化,在重新训练后的第一阶段,XGBoost和DL的情况恶化了50%以上。
    数据的突然变化会影响模型性能。如果改变仅是瞬时的,则用更新的数据重新训练模型可能导致预测准确性的降低。因此应避免过早的再训练,密切的模型监测是必要的。
    Machine-learning models are susceptible to external influences which can result in performance deterioration. The aim of our study was to elucidate the impact of a sudden shift in covariates, like the one caused by the Covid-19 pandemic, on model performance.
    After ethical approval and registration in Clinical Trials (NCT04092933, initial release 17/09/2019), we developed different models for the prediction of perioperative mortality based on preoperative data: one for the pre-pandemic data period until March 2020, one including data before the pandemic and from the first wave until May 2020, and one that covers the complete period before and during the pandemic until October 2021. We applied XGBoost as well as a Deep Learning neural network (DL). Performance metrics of each model during the different pandemic phases were determined, and XGBoost models were analysed for changes in feature importance.
    XGBoost and DL provided similar performance on the pre-pandemic data with respect to area under receiver operating characteristic (AUROC, 0.951 vs. 0.942) and area under precision-recall curve (AUPR, 0.144 vs. 0.187). Validation in patient cohorts of the different pandemic waves showed high fluctuations in performance from both AUROC and AUPR for DL, whereas the XGBoost models seemed more stable. Change in variable frequencies with onset of the pandemic were visible in age, ASA score, and the higher proportion of emergency operations, among others. Age consistently showed the highest information gain. Models based on pre-pandemic data performed worse during the first pandemic wave (AUROC 0.914 for XGBoost and DL) whereas models augmented with data from the first wave lacked performance after the first wave (AUROC 0.907 for XGBoost and 0.747 for DL). The deterioration was also visible in AUPR, which worsened by over 50% in both XGBoost and DL in the first phase after re-training.
    A sudden shift in data impacts model performance. Re-training the model with updated data may cause degradation in predictive accuracy if the changes are only transient. Too early re-training should therefore be avoided, and close model surveillance is necessary.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Meta-Analysis
    可解释的荟萃分析结合了来自随机对照试验集合的信息,以估计目标人群的治疗效果,在目标人群中可能无法进行实验,但可以从中获得协变量信息。在这样的分析中,一个关键的实际挑战是,当一些试验收集了一个或多个基线协变量的数据时,存在系统缺失的数据,但是其他的试验没有,因此,后者的所有参与者都缺少协变量信息。在这篇文章中,当荟萃分析中的一些试验系统地缺失协变量数据时,我们提供目标人群中潜在(反事实)结局均值和平均治疗效果的识别结果.我们为目标人群的平均治疗效果提出了三个估计器,检查它们的渐近性质,并表明它们在仿真研究中具有良好的有限样本性能。我们使用估算器分析来自两项大型肺癌筛查试验的数据以及来自国家健康与营养检查调查(NHANES)的目标人群数据。为了适应NHANES的复杂调查设计,我们修改了方法,以纳入调查抽样权重,并允许聚类。
    Causally interpretable meta-analysis combines information from a collection of randomized controlled trials to estimate treatment effects in a target population in which experimentation may not be possible but from which covariate information can be obtained. In such analyses, a key practical challenge is the presence of systematically missing data when some trials have collected data on one or more baseline covariates, but other trials have not, such that the covariate information is missing for all participants in the latter. In this article, we provide identification results for potential (counterfactual) outcome means and average treatment effects in the target population when covariate data are systematically missing from some of the trials in the meta-analysis. We propose three estimators for the average treatment effect in the target population, examine their asymptotic properties, and show that they have good finite-sample performance in simulation studies. We use the estimators to analyze data from two large lung cancer screening trials and target population data from the National Health and Nutrition Examination Survey (NHANES). To accommodate the complex survey design of the NHANES, we modify the methods to incorporate survey sampling weights and allow for clustering.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们提出了用于估计目标人群中预测模型的接受者工作特征(ROC)曲线(AUC)下面积的方法,该目标人群不同于提供用于原始模型开发的数据的源人群。如果与模型性能相关的协变量,由AUC测量,在来源和目标人群中有不同的分布,则仅使用来自源人群的数据的AUC估计器将无法反映目标人群的模型性能。这里,当结果和协变量数据可从来源人群的样本中获得时,我们提供目标人群AUC的识别结果,但只有协变量数据可从目标人群的样本中获得。在此设置中,我们为目标人群中的AUC提出了三个估计,并表明它们是一致且渐近正态的。我们使用模拟评估了估计器的有限样本性能,并使用它们来估计全国健康和营养检查调查中具有全国代表性的目标人群的AUC,该调查是使用国家肺癌筛查试验的源人群数据开发的肺癌风险预测模型。
    We propose methods for estimating the area under the receiver operating characteristic (ROC) curve (AUC) of a prediction model in a target population that differs from the source population that provided the data used for original model development. If covariates that are associated with model performance, as measured by the AUC, have a different distribution in the source and target populations, then AUC estimators that only use data from the source population will not reflect model performance in the target population. Here, we provide identification results for the AUC in the target population when outcome and covariate data are available from the sample of the source population, but only covariate data are available from the sample of the target population. In this setting, we propose three estimators for the AUC in the target population and show that they are consistent and asymptotically normal. We evaluate the finite-sample performance of the estimators using simulations and use them to estimate the AUC in a nationally representative target population from the National Health and Nutrition Examination Survey for a lung cancer risk prediction model developed using source population data from the National Lung Screening Trial.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们提出了一种直接域自适应(DDA)方法,以通过来自现实世界数据的特征来丰富对合成数据的监督神经网络的训练。该过程涉及对NN模型的输入特征进行一系列线性操作,无论它们来自源或目标分布,如下:(1)输入数据的互相关(即,图像),其中包含来自输入的所有图像的随机选取的样本像素(或像素)或所有输入图像的所有随机选取的样本像素(或像素)的平均值。(2)所得数据与来自其他域的自相关输入图像的平均值的卷积。在训练阶段,正如预期的那样,输入图像来自源分布,并根据目标分布评估自相关图像的均值。在推理/应用阶段,输入图像来自目标分布,并从源分布评估自相关图像的平均值。所提出的方法仅操作来自源域和目标域的数据,并且不会显式地干扰训练工作流和网络架构。包括在MNIST数据集上训练卷积神经网络并在MNIST-M数据集上测试网络的应用在测试数据上实现了70%的准确度。主成分分析(PCA),以及T-SNE,显示来自源域和目标域的输入功能,在提议的直接转换之后,与原始MNIST和MNIST-M输入功能相比,沿主组件共享相似的属性。
    We propose a direct domain adaptation (DDA) approach to enrich the training of supervised neural networks on synthetic data by features from real-world data. The process involves a series of linear operations on the input features to the NN model, whether they are from the source or target distributions, as follows: (1) A cross-correlation of the input data (i.e., images) with a randomly picked sample pixel (or pixels) of all images from the input or the mean of all randomly picked sample pixel (or pixels) of all input images. (2) The convolution of the resulting data with the mean of the autocorrelated input images from the other domain. In the training stage, as expected, the input images are from the source distribution, and the mean of auto-correlated images are evaluated from the target distribution. In the inference/application stage, the input images are from the target distribution, and the mean of auto-correlated images are evaluated from the source distribution. The proposed method only manipulates the data from the source and target domains and does not explicitly interfere with the training workflow and network architecture. An application that includes training a convolutional neural network on the MNIST dataset and testing the network on the MNIST-M dataset achieves a 70% accuracy on the test data. A principal component analysis (PCA), as well as t-SNE, shows that the input features from the source and target domains, after the proposed direct transformations, share similar properties along the principal components as compared to the original MNIST and MNIST-M input features.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们考虑了运输用于新目标人群的预测模型的方法,当模型开发的结果和协变量数据可从与目标人群相比具有不同协变量分布的源人群获得时,以及当协变量数据(但非结果数据)可从目标人群获得时.我们讨论了如何定制预测模型,以解决源人群和目标人群之间数据分布的差异。我们还讨论了如何评估模型的性能(例如,通过估计目标人群的均方预测误差)。在源样本和目标样本分别获得的抽样设计下,我们为潜在错误指定的预测模型的目标群体中模型性能的度量提供了可识别性结果。我们引入了预测误差修饰符的概念,该修饰符可用于为目标人群调整模型性能的度量。我们在模拟数据中说明了这些方法,并将其应用于将国家肺癌筛查试验的肺癌诊断预测模型转移到国家健康和营养检查调查中符合试验资格的全国代表性目标人群。
    We considered methods for transporting a prediction model for use in a new target population, both when outcome and covariate data for model development are available from a source population that has a different covariate distribution compared with the target population and when covariate data (but not outcome data) are available from the target population. We discuss how to tailor the prediction model to account for differences in the data distribution between the source population and the target population. We also discuss how to assess the model\'s performance (e.g., by estimating the mean squared prediction error) in the target population. We provide identifiability results for measures of model performance in the target population for a potentially misspecified prediction model under a sampling design where the source and the target population samples are obtained separately. We introduce the concept of prediction error modifiers that can be used to reason about tailoring measures of model performance to the target population. We illustrate the methods in simulated data and apply them to transport a prediction model for lung cancer diagnosis from the National Lung Screening Trial to the nationally representative target population of trial-eligible individuals in the National Health and Nutrition Examination Survey.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号