Generalizability

泛化
  • 文章类型: Journal Article
    了解人工智能(AI)模型对目标人群的泛化能力对于确保AI在医疗设备中的安全有效使用至关重要。传统的泛化性评估依赖于大量、不同的数据集,这在许多医学成像应用中很难获得。我们通过检查超出可用测试数据分布的决策空间,提出了一种增强泛化性评估的方法。
    通过在测试图像的三元组之间进行插值来生成虚拟样本的邻位分布。生成的虚拟样本利用测试集中已有的特征,增加样本多样性,同时保持接近AI模型的数据流形。我们展示了对患者性别进行分类的非临床任务的普遍性评估方法,种族,COVID状态,和胸部X光检查的年龄组。
    泛化的决策区域组成分析表明,决策空间的很大一部分属于每个任务的单个“首选”类,尽管在评估数据集上表现相当。使用交叉反应性和总体转移策略进行的评估表明,倾向于过度预测样本属于首选类别(例如,COVID阴性)适用于模型开发数据中未代表亚组的患者。
    对AI模型的决策空间的分析有可能提供对模型泛化性的洞察。在测试数据有限的情况下,我们的方法使用对决策空间组成的分析来获得对模型泛化性的改进评估。
    UNASSIGNED: Understanding an artificial intelligence (AI) model\'s ability to generalize to its target population is critical to ensuring the safe and effective usage of AI in medical devices. A traditional generalizability assessment relies on the availability of large, diverse datasets, which are difficult to obtain in many medical imaging applications. We present an approach for enhanced generalizability assessment by examining the decision space beyond the available testing data distribution.
    UNASSIGNED: Vicinal distributions of virtual samples are generated by interpolating between triplets of test images. The generated virtual samples leverage the characteristics already in the test set, increasing the sample diversity while remaining close to the AI model\'s data manifold. We demonstrate the generalizability assessment approach on the non-clinical tasks of classifying patient sex, race, COVID status, and age group from chest x-rays.
    UNASSIGNED: Decision region composition analysis for generalizability indicated that a disproportionately large portion of the decision space belonged to a single \"preferred\" class for each task, despite comparable performance on the evaluation dataset. Evaluation using cross-reactivity and population shift strategies indicated a tendency to overpredict samples as belonging to the preferred class (e.g., COVID negative) for patients whose subgroup was not represented in the model development data.
    UNASSIGNED: An analysis of an AI model\'s decision space has the potential to provide insight into model generalizability. Our approach uses the analysis of composition of the decision space to obtain an improved assessment of model generalizability in the case of limited test data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    英国生物库研究包含几个诊断数据来源,包括医院住院数据和约500,000参与者的自我报告状况,以及约17.7万名参与者(35%)的初级保健数据。流行病学调查需要对主要疾病进行定义,但是,是结合来源以最大限度地发挥力量还是专注于一个以确保一致的结果尚不清楚。静脉血栓栓塞症(VTE)定义的一致性是在从住院患者数据中定义病例时通过观察重叠来进行的。初级保健报告,和自我报告的问卷。VTE案例显示数据源之间几乎没有重叠,对于所有三家医院都确定有初级保健数据的人,只有6%的报告事件,初级保健,和自我报告,而71%的人只出现在一个来源。只有深静脉血栓形成的事件占自我报告的68%,占医院报告的VTE病例的36%。而仅肺栓塞事件占自我报告的20%,占医院报告的VTE病例的50%.此外,观察到不同的社会人口统计学特征分布;例如,46%的医院报告的VTE病例是女性,与58%的自我报告的VTE病例相比。这些结果说明了为提高数据质量而采取的看似中立的决策如何影响数据集的代表性。
    The UK Biobank study contains several sources of diagnostic data, including hospital inpatient data and data on self-reported conditions for approximately 500,000 participants and primary-care data for approximately 177,000 participants (35%). Epidemiologic investigations require a primary disease definition, but whether to combine data sources to maximize statistical power or focus on only 1 source to ensure a consistent outcome is not clear. The consistency of disease definitions was investigated for venous thromboembolism (VTE) by evaluating overlap when defining cases from 3 sources: hospital inpatient data, primary-care reports, and self-reported questionnaires. VTE cases showed little overlap between data sources, with only 6% of reported events for persons with primary-care data being identified by all 3 sources (hospital, primary-care, and self-reports), while 71% appeared in only 1 source. Deep vein thrombosis-only events represented 68% of self-reported VTE cases and 36% of hospital-reported VTE cases, while pulmonary embolism-only events represented 20% of self-reported VTE cases and 50% of hospital-reported VTE cases. Additionally, different distributions of sociodemographic characteristics were observed; for example, patients in 46% of hospital-reported VTE cases were female, compared with 58% of self-reported VTE cases. These results illustrate how seemingly neutral decisions taken to improve data quality can affect the representativeness of a data set.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随机试验的意向治疗比较提供了治疗分配效果的渐近一致估计,不考虑遵守。然而,决策者通常希望知道每个协议比较的效果。此外,决策者还可能希望了解治疗分配或治疗方案对用户指定的目标人群的影响,而不是试验所在的样本.这里,我们的目的是将ACTGA5095试验的结果推广到美国最近诊断为HIV的目标人群.
    在A5095试验中,我们首次复制了已发表的常规意向治疗估计(2年风险差异和风险比),比较了4种药物抗逆转录病毒方案和3种药物方案。然后,我们通过构建逆概率权重来估计造成信息丢失的意向治疗效果和另外造成协议偏差的每个协议效果。此外,我们采用抽样权重的逆几率将意向治疗效应和符合方案效应推广到目标人群,目标人群包括2008-2014年诊断为HIV的美国个体.
    在分析的761个受试者中,在随访的前2年中,发生了82例退出(三药组36例,四药组46例)和59例方案偏差(三药组25例,四药组34例)。共有169名受试者发生病毒学失败或死亡。2年的风险是相似的,无论是在试验和美国艾滋病毒诊断的目标人群的估计从传统的意向治疗,辍学加权意向治疗,和符合方案的分析。在美国目标人群中,与四药组和三药组相比,病毒学失败或死亡的2年常规意向治疗风险差异(单位:%)为-0.4(95%置信区间:-6.2,5.1),而风险比为0.97(95%置信区间:0.70,1.34);对于退出加权意向治疗比较(风险比=0.95,95%置信区间:0.68,1.32)和-0.7(95%置信区间:-6.7,5.5),2年风险差为-0.9(95%置信区间:-6.9,5.3)。
    在传统的意向治疗中,没有发现四药抗逆转录病毒方案相对于三药方案的益处,试验样本或目标人群中的退出加权意向治疗或符合方案估计。
    Intention-to-treat comparisons of randomized trials provide asymptotically consistent estimators of the effect of treatment assignment, without regard to compliance. However, decision makers often wish to know the effect of a per-protocol comparison. Moreover, decision makers may also wish to know the effect of treatment assignment or treatment protocol in a user-specified target population other than the sample in which the trial was fielded. Here, we aimed to generalize results from the ACTG A5095 trial to the US recently HIV-diagnosed target population.
    We first replicated the published conventional intention-to-treat estimate (2-year risk difference and hazard ratio) comparing a four-drug antiretroviral regimen to a three-drug regimen in the A5095 trial. We then estimated the intention-to-treat effect that accounted for informative dropout and the per-protocol effect that additionally accounted for protocol deviations by constructing inverse probability weights. Furthermore, we employed inverse odds of sampling weights to generalize both intention-to-treat and per-protocol effects to a target population comprising US individuals with HIV diagnosed during 2008-2014.
    Of 761 subjects in the analysis, 82 dropouts (36 in the three-drug arm and 46 in the four-drug arm) and 59 protocol deviations (25 in the three-drug arm and 34 in the four-drug arm) occurred during the first 2 years of follow-up. A total of 169 subjects incurred virologic failure or death. The 2-year risks were similar both in the trial and in the US HIV-diagnosed target population for estimates from the conventional intention-to-treat, dropout-weighted intention-to-treat, and per-protocol analyses. In the US target population, the 2-year conventional intention-to-treat risk difference (unit: %) for virologic failure or death comparing the four-drug arm to the three-drug arm was -0.4 (95% confidence interval: -6.2, 5.1), while the hazard ratio was 0.97 (95% confidence interval: 0.70, 1.34); the 2-year risk difference was -0.9 (95% confidence interval: -6.9, 5.3) for the dropout-weighted intention-to-treat comparison (hazard ratio = 0.95, 95% confidence interval: 0.68, 1.32) and -0.7 (95% confidence interval: -6.7, 5.5) for the per-protocol comparison (hazard ratio = 0.96, 95% confidence interval: 0.69, 1.34).
    No benefit of four-drug antiretroviral regimen over three-drug regimen was found from the conventional intention-to-treat, dropout-weighted intention-to-treat or per-protocol estimates in the trial sample or target population.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    这项研究评估了自然语言处理(NLP)工具的准确性和可移植性,该工具用于从两个大型医疗保健系统的临床笔记中提取流感的临床发现。评估了NLP对下游流感病例检测进行疾病监测的支持程度。
    我们独立开发了两个NLP解析器,一个在犹他州的IntermountainHealthcare(IH),另一个在匹兹堡大学医学中心(UPMC),使用急诊科(ED)遇到流感的当地临床记录。我们测量了存在和不存在指示流感的70个临床发现的NLP解析器性能。然后,我们从NLP处理的报告中开发了贝叶斯网络模型,并测试了它们区分(1)流感病例的能力,(2)非流感流感样疾病(NI-ILI),和(3)\'其他\'诊断。
    关于IntermountainHealthcare的报告,IHNLP解析器的召回率和精确度分别为0.71和0.75,和UPMCNLP解析器,0.67和0.79。匹兹堡大学医学中心报告,UPMCNLP解析器的召回率和准确率分别为0.73和0.80,和IHNLP解析器,0.53和0.80。在IntermountainHealthcare病例中,通过AUROC对流感与非流感进行测量的贝叶斯病例检测性能为0.93(使用IHNLP解析器)和0.93(使用UPMCNLP解析器)。匹兹堡大学医学中心病例的病例检测为0.95(使用UPMCNLP解析器)和0.83(使用IHNLP解析器)。对于流感与NI-ILI在IntermountainHealthcare病例中的表现分别为0.70(使用IHNLP解析器)和0.76(使用UPMCNLP解析器)。关于皮斯堡大学医学中心的案例,0.76(使用UPMCNLP解析器)和0.65(使用IHNLP解析器)。
    在所有情况下(使用IH病例的流感与NI-ILI),尽管非本地解析器的性能合理,但本地解析器在支持案例检测方面更有效。
    This study evaluates the accuracy and portability of a natural language processing (NLP) tool for extracting clinical findings of influenza from clinical notes across two large healthcare systems. Effectiveness is evaluated on how well NLP supports downstream influenza case-detection for disease surveillance.
    We independently developed two NLP parsers, one at Intermountain Healthcare (IH) in Utah and the other at University of Pittsburgh Medical Center (UPMC) using local clinical notes from emergency department (ED) encounters of influenza. We measured NLP parser performance for the presence and absence of 70 clinical findings indicative of influenza. We then developed Bayesian network models from NLP processed reports and tested their ability to discriminate among cases of (1) influenza, (2) non-influenza influenza-like illness (NI-ILI), and (3) \'other\' diagnosis.
    On Intermountain Healthcare reports, recall and precision of the IH NLP parser were 0.71 and 0.75, respectively, and UPMC NLP parser, 0.67 and 0.79. On University of Pittsburgh Medical Center reports, recall and precision of the UPMC NLP parser were 0.73 and 0.80, respectively, and IH NLP parser, 0.53 and 0.80. Bayesian case-detection performance measured by AUROC for influenza versus non-influenza on Intermountain Healthcare cases was 0.93 (using IH NLP parser) and 0.93 (using UPMC NLP parser). Case-detection on University of Pittsburgh Medical Center cases was 0.95 (using UPMC NLP parser) and 0.83 (using IH NLP parser). For influenza versus NI-ILI on Intermountain Healthcare cases performance was 0.70 (using IH NLP parser) and 0.76 (using UPMC NLP parser). On University of Pisstburgh Medical Center cases, 0.76 (using UPMC NLP parser) and 0.65 (using IH NLP parser).
    In all but one instance (influenza versus NI-ILI using IH cases), local parsers were more effective at supporting case-detection although performances of non-local parsers were reasonable.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    鉴于人们越来越担心研究与政策和实践的相关性,对评估和增强随机试验的外部有效性的兴趣越来越大:确定给定的随机试验对于为特定目标人群提供政策问题的有用性.
    本文重点介绍了在评估和增强外部有效性方面的最新进展,重点关注事后统计调整所需的数据,以增强实验结果对可能与其研究样本不同的人群的适用性。
    我们使用案例研究来说明如何将随机试验样本中的治疗效果估计值推广到目标人群,特别是在一项针对HeadStart中心的补充计划的随机试验中比较了儿童样本(基于研究的,发展性知情研究)对有资格获得先发制人的全国儿童人口,如先发影响研究所示。
    对于本案例研究,试验样本和总体之间的共同数据元素是有限的,从试验样本到群体的可靠概括具有挑战性。
    要回答有关外部有效性的重要问题,需要更多的公开数据。此外,未来的研究应该努力收集类似于其他数据集中的措施。衡量人口数据集和使用便利样本的随机试验之间的可比性将大大增强可以回答的研究和政策相关问题的范围。
    Given increasing concerns about the relevance of research to policy and practice, there is growing interest in assessing and enhancing the external validity of randomized trials: determining how useful a given randomized trial is for informing a policy question for a specific target population.
    This article highlights recent advances in assessing and enhancing external validity, with a focus on the data needed to make ex post statistical adjustments to enhance the applicability of experimental findings to populations potentially different from their study sample.
    We use a case study to illustrate how to generalize treatment effect estimates from a randomized trial sample to a target population, in particular comparing the sample of children in a randomized trial of a supplemental program for Head Start centers (the Research-Based, Developmentally Informed study) to the national population of children eligible for Head Start, as represented in the Head Start Impact Study.
    For this case study, common data elements between the trial sample and population were limited, making reliable generalization from the trial sample to the population challenging.
    To answer important questions about external validity, more publicly available data are needed. In addition, future studies should make an effort to collect measures similar to those in other data sets. Measure comparability between population data sets and randomized trials that use samples of convenience will greatly enhance the range of research and policy relevant questions that can be answered.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    BACKGROUND: Personalizing medical care is becoming increasingly popular, particularly mental health care. There is growing interest in formalizing medical decision making based on evolving patient symptoms in an evidence-based manner. To determine optimal sequencing of treatments, the sequences themselves must be studied; this may be accomplished by using a sequential multiple assignment randomized trial (SMART). It has been hypothesized that SMART studies may improve participant retention and generalizability.
    METHODS: We examine the hypotheses that SMART studies are more generalizable and have better retention than traditional randomized clinical trials via a case study of a SMART study of antipsychotic medications. We considered the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) schizophrenia study, comparing the trial participant characteristics and overall retention to those of comparable trials found via a review of all related trials conducted from 2000 onwards.
    RESULTS: A MEDLINE search returned 6435 results for primary screening; ultimately, 48 distinct trials were retained for analysis. The study population in CATIE was similar to, although perhaps less symptomatic than, the study populations of traditional randomized clinical trials (RCTs), suggesting no large gains in generalizability despite the pragmatic nature of the trial. However, CATIE did see good month-by-month retention.
    CONCLUSIONS: SMARTs offer the possibility of studying treatment sequences in a way that a series of traditional RCTs cannot. SMARTs may offer improved retention; however, this case study did not find evidence to suggest greater generalizability using this trial design.
    BACKGROUND: ClinicalTrials.gov NCT00014001 . Registered on 6 April 2001.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    政策制定者需要对适用于感兴趣人群的相对有效性进行估计,但关于评估和扩展基于随机对照试验(RCT)的评价的普遍性的定量方法的研究很少.我们说明了一种使用观测数据的方法。
    我们的例子是全系统演示器(WSD)试验,其中3230名患有慢性病的成年人被分配接受远程医疗或常规护理。首先,我们使用新的安慰剂试验来评估RCT对照组和匹配的接受常规治疗的非参与者亚组之间的结局是否相似.我们匹配了从电子病历中获得的65个基线变量。第二,我们进行了敏感性分析,以考虑对治疗有效性的估计是否符合关于“常规治疗”是否由RCT对照组或非参与者定义的替代假设.因此,我们通过对比RCT远程健康组和匹配的非参与者的结局,提供了相对有效性的替代估计.
    对于某些端点,例如门诊就诊人数,安慰剂测试通过了,和有效性估计是稳健的选择比较组。然而,对于其他端点,比如紧急入院,安慰剂试验失败,根据远程健康患者是否与RCT对照组或匹配的非参与者进行比较,治疗效果的估计值存在显著差异.
    拟议的安慰剂测试表明,当RCT的估计不能推广到常规临床实践中,并使用观察数据对比较有效性进行补充估计时,这些情况。建议将来的RCT纳入这些安慰剂测试和随附的敏感性分析,以增强其与政策制定的相关性。
    Policy makers require estimates of comparative effectiveness that apply to the population of interest, but there has been little research on quantitative approaches to assess and extend the generalizability of randomized controlled trial (RCT)-based evaluations. We illustrate an approach using observational data.
    Our example is the Whole Systems Demonstrator (WSD) trial, in which 3230 adults with chronic conditions were assigned to receive telehealth or usual care. First, we used novel placebo tests to assess whether outcomes were similar between the RCT control group and a matched subset of nonparticipants who received usual care. We matched on 65 baseline variables obtained from the electronic medical record. Second, we conducted sensitivity analysis to consider whether the estimates of treatment effectiveness were robust to alternative assumptions about whether \"usual care\" is defined by the RCT control group or nonparticipants. Thus, we provided alternative estimates of comparative effectiveness by contrasting the outcomes of the RCT telehealth group and matched nonparticipants.
    For some endpoints, such as the number of outpatient attendances, the placebo tests passed, and the effectiveness estimates were robust to the choice of comparison group. However, for other endpoints, such as emergency admissions, the placebo tests failed and the estimates of treatment effect differed markedly according to whether telehealth patients were compared with RCT controls or matched nonparticipants.
    The proposed placebo tests indicate those cases when estimates from RCTs do not generalize to routine clinical practice and motivate complementary estimates of comparative effectiveness that use observational data. Future RCTs are recommended to incorporate these placebo tests and the accompanying sensitivity analyses to enhance their relevance to policy making.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号