Imputation

Imputation
  • 文章类型: Journal Article
    肾结石疾病是一种广泛的泌尿系统疾病,影响全球数百万人。及时诊断对于避免严重并发症至关重要。传统上,使用计算机断层扫描(CT)检测肾结石,which,尽管它的有效性,是昂贵的,资源密集型,让病人暴露于不必要的辐射,并且通常由于放射学报告等待时间而导致延迟。这项研究提出了一种利用机器学习的新方法,利用常规实验室检测结果早期检测肾结石。我们利用了一个广泛的数据集,包括来自沙特阿拉伯医院的2156个患者记录,具有15个属性,具有数据缺失和类不平衡等挑战。我们评估了各种机器学习算法和插补方法,包括单一和多重归算,以及过采样和欠采样技术。我们的结果表明,基于集成树的分类器,特别是随机森林(RF)和额外的树分类器(ETree),以99%的显著准确率胜过其他人,召回率98%,RF的F1得分为99%,92%为ETree。这项研究强调了非侵入性,用于肾结石检测的具有成本效益的实验室检查,促进及时和改进的医疗支持。
    Kidney stone disease is a widespread urological disorder affecting millions globally. Timely diagnosis is crucial to avoid severe complications. Traditionally, renal stones are detected using computed tomography (CT), which, despite its effectiveness, is costly, resource-intensive, exposes patients to unnecessary radiation, and often results in delays due to radiology report wait times. This study presents a novel approach leveraging machine learning to detect renal stones early using routine laboratory test results. We utilized an extensive dataset comprising 2156 patient records from a Saudi Arabian hospital, featuring 15 attributes with challenges such as missing data and class imbalance. We evaluated various machine learning algorithms and imputation methods, including single and multiple imputations, as well as oversampling and undersampling techniques. Our results demonstrate that ensemble tree-based classifiers, specifically random forest (RF) and extra tree classifiers (ETree), outperform others with remarkable accuracy rates of 99%, recall rates of 98%, and F1 scores of 99% for RF, and 92% for ETree. This study underscores the potential of non-invasive, cost-effective laboratory tests for renal stone detection, promoting prompt and improved medical support.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在阿尔茨海默病神经影像学计划(ADNI)中,数据缺失很普遍。通常在统计分析之前通过删除缺少条目的受试者来处理错误;但是,这可能会导致显著的效率损失,有时甚至会产生偏差。尚未证明,在某些纵向回归设置中,处理此问题的插补方法可能很有价值。
    这项研究的目的是通过分析纵向阿尔茨海默病评估量表-认知子量表13(ADAS-Cog13)得分及其与基线患者特征的关联,来证明填补的重要性以及如何在ADNI中正确进行填补。
    我们研究了1,063名患有轻度认知障碍的ADNI受试者。用线性混合效应模型对纵向ADAS-Cog13评分进行建模,以基线临床和人口统计学特征为预测因子。将未进行估算的模型估算值与通过链式方程(MICE)进行多次估算的估算值进行比较。我们通过调查缺失的数据机制和模型假设来证明MICE的应用是合理的。我们还评估了结果对插补方法选择的稳健性。
    在MICE产量有效的情况下,线性混合效应模型的固定效应估计,更严格的置信区间,从而提高了分析的效率相比,没有插补的分析。
    我们的研究证明了在ADNI中考虑缺失数据的重要性。当决定执行归因时,在选择方法时应该小心,作为一个无效的人可能会损害统计分析。
    UNASSIGNED: Missing data is prevalent in the Alzheimer\'s Disease Neuroimaging Initiative (ADNI). It is common to deal with missingness by removing subjects with missing entries prior to statistical analysis; however, this can lead to significant efficiency loss and sometimes bias. It has yet to be demonstrated that the imputation approach to handling this issue can be valuable in some longitudinal regression settings.
    UNASSIGNED: The purpose of this study is to demonstrate the importance of imputation and how imputation is correctly done in ADNI by analyzing longitudinal Alzheimer\'s Disease Assessment Scale -Cognitive Subscale 13 (ADAS-Cog 13) scores and their association with baseline patient characteristics.
    UNASSIGNED: We studied 1,063 subjects in ADNI with mild cognitive impairment. Longitudinal ADAS-Cog 13 scores were modeled with a linear mixed-effects model with baseline clinical and demographic characteristics as predictors. The model estimates obtained without imputation were compared with those obtained after imputation with Multiple Imputation by Chained Equations (MICE). We justify application of MICE by investigating the missing data mechanism and model assumptions. We also assess robustness of the results to the choice of imputation method.
    UNASSIGNED: The fixed-effects estimates of the linear mixed-effects model after imputation with MICE yield valid, tighter confidence intervals, thus improving the efficiency of the analysis when compared to the analysis done without imputation.
    UNASSIGNED: Our study demonstrates the importance of accounting for missing data in ADNI. When deciding to perform imputation, care should be taken in choosing the approach, as an invalid one can compromise the statistical analyses.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    小心处理缺失的数据对于确保临床预测模型的开发至关重要,已验证,并以稳健的方式实施。我们确定了在评估用于处理验证和实施过程中缺失数据的不同方法组合的预测性能时的偏差。我们发现了四种策略在整个模型管道中兼容,并为在不同错误机制下处理模型验证和实施之间的缺失数据提供了建议。
    Careful handling of missing data is crucial to ensure that clinical prediction models are developed, validated, and implemented in a robust manner. We determined the bias in estimating predictive performance of different combinations of approaches for handling missing data across validation and implementation. We found four strategies that are compatible across the model pipeline and have provided recommendations for handling missing data between model validation and implementation under different missingness mechanisms.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:在临床预测建模中,缺失数据可能发生在模型管道的任何阶段;开发,验证或部署。通常建议在部署时应用多重归集,但具有挑战性;例如,结果不能在归责模型中,根据多重归因的建议。回归插补使用拟合模型从观测数据中推算缺失预测因子的预测值,并可以在部署时提供务实的替代方案。此外,建议使用缺失指标来处理信息缺失,但目前尚不清楚这种方法在临床预测模型中的表现如何。方法:我们在各种缺失数据机制下模拟数据,以比较使用两种归因方法开发的临床预测模型的预测性能。我们考虑允许或禁止丢失数据的部署方案,使用或省略结果的估算模型,以及包括或省略缺失指标的临床预测模型。我们假设错误机制在整个模型管道中保持不变。我们还将建议的策略应用于重症监护数据。结果:在部署时提供完整的数据,我们的研究结果与现有的建议一致;当使用多重插补时,应将结果用于插补发育数据,而在回归插补中省略.当部署时允许不安全时,优选在开发时省略归因模型的结果。在许多情况下,缺少指标会改善模型性能,但在依赖于结果的错误情况下可能是有害的。结论:我们提供的证据表明,通常教导的通过多重插补处理缺失数据的原则可能不适用于临床预测模型,特别是在部署时数据可能丢失时。我们在多重插补和回归插补下观察到了可比的预测性能。必须在逐个研究的基础上评估缺失数据处理方法的性能,在开发时处理缺失数据的最适当策略应该考虑在部署时是否允许缺失数据。提供了一些指导。
    Background: In clinical prediction modelling, missing data can occur at any stage of the model pipeline; development, validation or deployment. Multiple imputation is often recommended yet challenging to apply at deployment; for example, the outcome cannot be in the imputation model, as recommended under multiple imputation. Regression imputation uses a fitted model to impute the predicted value of missing predictors from observed data, and could offer a pragmatic alternative at deployment. Moreover, the use of missing indicators has been proposed to handle informative missingness, but it is currently unknown how well this method performs in the context of clinical prediction models. Methods: We simulated data under various missing data mechanisms to compare the predictive performance of clinical prediction models developed using both imputation methods. We consider deployment scenarios where missing data is permitted or prohibited, imputation models that use or omit the outcome, and clinical prediction models that include or omit missing indicators. We assume that the missingness mechanism remains constant across the model pipeline. We also apply the proposed strategies to critical care data. Results: With complete data available at deployment, our findings were in line with existing recommendations; that the outcome should be used to impute development data when using multiple imputation and omitted under regression imputation. When missingness is allowed at deployment, omitting the outcome from the imputation model at the development was preferred. Missing indicators improved model performance in many cases but can be harmful under outcome-dependent missingness. Conclusion: We provide evidence that commonly taught principles of handling missing data via multiple imputation may not apply to clinical prediction models, particularly when data can be missing at deployment. We observed comparable predictive performance under multiple imputation and regression imputation. The performance of the missing data handling method must be evaluated on a study-by-study basis, and the most appropriate strategy for handling missing data at development should consider whether missing data are allowed at deployment. Some guidance is provided.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    个性化医疗的呼吁强调了个性化(N-of-1)试验的必要性,以找到最适合个体患者的治疗方法。常规(受试者间)随机对照试验(RCT)对普通患者产生影响,但个性化试验管理受试者内的所有治疗,因此可以确定对个体患者的益处或危害。个性化试验的设计和分析涉及与常规RCT不同的策略。这些包括如何调整从一种干预到另一种干预的任何遗留影响,如何处理丢失的数据,以及如何为患者提供深入了解他们的数据。此外,应该为每位患者及其临床医生创建一份易于理解的试验结果报告,以便于他们做出决策.本文介绍了解决这些设计和分析问题的策略,并介绍了一个R闪亮的应用程序来促进他们的解决方案,解释每个设计和统计策略的使用。为了说明,我们还提供了一个旨在增加活动的个性化试验系列的具体示例(即,步行步骤)慢性下背痛(CLBP)患者。
    The call for personalized medicine highlights the need for personalized (N-of-1) trials to find what treatment works best for individual patients. Conventional (between-subject) randomized controlled trials (RCT) yield effects for the \'average patient,\' but a personalized trial administers all treatments within-subject, so benefits or harms to the individual patient can be identified. The design and analysis of personalized trials involve different strategies from the conventional RCT. These include how to adjust for any carryover effects from one intervention to another, how to handle missing data, and how to provide patients with insight into their data. In addition, a comprehensible report about trial results should be created for each patient and their clinician to facilitate their decision-making. This article describes strategies to address these design and analytic issues, and introduces an R shiny app to facilitate their solution, to explain the use of each of the design and statistical strategies. To illustrate, we also provide a concrete example of a personalized trial series designed to increase activity (i.e., walking steps) in patients with chronic lower back pain (CLBP).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    作为优化方法,以确定用于密集基因分型的最佳动物,以构建用于基因型填补的参考种群,MCA和MCG方法,使用基于谱系的加性遗传关系矩阵(A矩阵)和基因组关系矩阵(G矩阵),分别,已被提议。我们使用575头日本黑牛评估了MCA和MCG方法的性能。提供谱系数据以追溯到五代以构建A矩阵,其中谱系深度从1改变为5(五种MCA方法)。基于VanRaden方法1和2(两种MCG方法),使用36,426个单核苷酸多态性的基因型信息来计算G矩阵。MCG每次迭代总是选择一头牛,而MCA有时会选择多头牛。MCA和MCG方法之间通常选择的母牛的数量通常低于不同MCA方法之间或不同MCG方法之间的数量。对于被研究的人群,MCG似乎比MCA更合理,可以选择奶牛作为参考群体,进行高密度基因型填补,以进行基因组预测和全基因组关联研究。
    As optimization methods to identify the best animals for dense genotyping to construct a reference population for genotype imputation, the MCA and MCG methods, which use the pedigree-based additive genetic relationship matrix (A matrix) and the genomic relationship matrix (G matrix), respectively, have been proposed. We assessed the performance of MCA and MCG methods using 575 Japanese Black cows. Pedigree data were provided to trace back up to five generations to construct the A matrix with changing the pedigree depth from 1 to 5 (five MCA methods). Genotype information on 36,426 single-nucleotide polymorphisms was used to calculate the G matrix based on VanRaden\'s methods 1 and 2 (two MCG methods). The MCG always selected one cow per iteration, while MCA sometimes selected multiple cows. The number of commonly selected cows between the MCA and MCG methods was generally lower than that between different MCA methods or between different MCG methods. For the studied population, MCG appeared to be more reasonable than MCA in selecting cows as a reference population for higher-density genotype imputation to perform genomic prediction and a genome-wide association study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:极低覆盖率(0.1至1倍)的全基因组测序(WGS)已成为一种有希望且负担得起的方法,用于发现人类群体的基因组变异以进行全基因组关联研究(GWAS)。为了支持在大量人群中使用植入前遗传检测(PGT)进行遗传筛查,测序覆盖率低于0.1倍,达到超低水平。然而,超低覆盖率WGS(ulcWGS)用于GWAS的可行性和有效性仍不确定.
    方法:我们构建了一个管道来对GWAS的ulcWGS数据进行分析。为了检查其有效性,我们使用平均覆盖率约为0.04倍的17,844个胚胎PGT样本和已知基因型的标准中国样本HG005,在低于0.1倍的不同覆盖率和2000至16,000个样本量的组合下,对基因型填补的准确性进行了基准测试。然后,我们将1744个具有胎龄和完整随访记录的移植胚胎的估算基因型应用于GWAS。
    结果:可以通过增加样本量和应用一组过滤器来提高超低覆盖率下基因型归因的准确性。从1744年出生的胚胎开始,我们确定了11个与胎龄相关的基因组风险位点和166个根据位置定位到这些位点的基因,表达数量性状基因座,和染色质相互作用策略。在这些映射的基因中,CRHBP,ICAM1和OXTR更经常被报道为早产相关。通过对以往研究的基因表达数据的联合分析,我们构建了主要是CRHBP的相互关系,ICAM1,PLAGL1,DNMT1,CNTLN,DKK1和EGR2伴早产,婴儿疾病,和乳腺癌。
    结论:这项研究不仅表明ulcWGS可以达到相对较高的准确性,适当的基因型填补,而且还提供了有关胎龄与中国人群胎儿胚胎遗传变异之间关系的见解。
    Very low-coverage (0.1 to 1×) whole genome sequencing (WGS) has become a promising and affordable approach to discover genomic variants of human populations for genome-wide association study (GWAS). To support genetic screening using preimplantation genetic testing (PGT) in a large population, the sequencing coverage goes below 0.1× to an ultra-low level. However, the feasibility and effectiveness of ultra-low-coverage WGS (ulcWGS) for GWAS remains undetermined.
    We built a pipeline to carry out analysis of ulcWGS data for GWAS. To examine its effectiveness, we benchmarked the accuracy of genotype imputation at the combination of different coverages below 0.1× and sample sizes from 2000 to 16,000, using 17,844 embryo PGT samples with approximately 0.04× average coverage and the standard Chinese sample HG005 with known genotypes. We then applied the imputed genotypes of 1744 transferred embryos who have gestational ages and complete follow-up records to GWAS.
    The accuracy of genotype imputation under ultra-low coverage can be improved by increasing the sample size and applying a set of filters. From 1744 born embryos, we identified 11 genomic risk loci associated with gestational ages and 166 genes mapped to these loci according to positional, expression quantitative trait locus, and chromatin interaction strategies. Among these mapped genes, CRHBP, ICAM1, and OXTR were more frequently reported as preterm birth related. By joint analysis of gene expression data from previous studies, we constructed interrelationships of mainly CRHBP, ICAM1, PLAGL1, DNMT1, CNTLN, DKK1, and EGR2 with preterm birth, infant disease, and breast cancer.
    This study not only demonstrates that ulcWGS could achieve relatively high accuracy of adequate genotype imputation and is capable of GWAS, but also provides insights into the associations between gestational age and genetic variations of the fetal embryos from Chinese population.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • DOI:
    文章类型: Journal Article
    The way missing data in population surveys are treated can influence research results. Therefore, the aim of this paper is to explain the reasons and procedure for imputing anthropometric data such as height and weight self-reported by individuals in the first four waves of the Mexican Health & Aging Study (MHAS). We highlight the effect of the imputation versus the exclusion of the cases with missing data, by comparing the distribution of these values and their associated effects on the Body Mass Index using a regression model. We conclude that the incorporation of imputed data offers more solid results compared with elimination the cases with missing data. Hence the importance of applying these statistical procedures, with appropriate treatment of the data, making the methodology and the imputed data available to the users by the same source of information, as offered in the MHAS.
    El manejo de los datos faltantes en entrevistas por encuestas puede influenciar los resultados de investigación. Por ello, el objetivo de este trabajo es explicar las razones y procedimiento para imputar datos antropométricos como la altura y peso auto reportado por los individuos en las primeras cuatro rondas del Estudio Nacional de Salud y Envejecimiento en México (ENASEM). Destacamos el efecto de la imputación versus la eliminación de los casos con datos faltantes, comparando la distribución de dichos valores y sus efectos asociados en el Índice de Masa Corporal mediante un modelo de regresión. Se concluye que la incorporación de datos imputados ofrece resultados más sólidos comparado con la eliminación de los casos con datos faltantes. De ahí la importancia de aplicar estos procedimientos estadísticos con tratamiento adecuado de los datos, y difundir la metodología aplicada para obtener los datos imputados desde la misma fuente de información, tal como se ofrece en el ENASEM.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    联合基因组预测(GP)是一种有吸引力的方法,可以通过组合来自多个种群的信息来提高GP的准确性。然而,许多因素会对联合GP的准确性产生负面影响,例如单核苷酸多态性(SNP)和因果变异之间的连锁不平衡定相差异,次要等位基因频率和因果变异在不同人群中的影响大小。这项研究的目的是调查是否输入的高密度基因型数据可以使用基因组最佳线性无偏预测(GBLUP)提高联合GP的准确性,单步GBLUP(ssGBLUP),多性状GBLUP(MT-GBLUP)和基于基因组关系矩阵的GBLUP,考虑了不同群体中异质次要等位基因频率(wGBLUP)。三个特征,包括达到屠宰体重所需的天数,背脂厚度和腰肌面积,对来自两个不同种群的67276头大型白猪进行了测量,通过SNP阵列对3334进行了基因分型。结果表明,与单种群GP相比,组合种群可以大大提高GP的准确性,特别是对于人口规模较小的人群。估算的SNP数据对单个种群GP没有影响,但有助于产生比联合GP的中密度阵列数据更高的准确性。在这四种方法中,ssGLBUP表现最好,但是ssGBLUP的优势随着更多个体的基因分型而降低。在某些情况下,MT-GBLUP和wGBLUP的表现优于GBLUP。总之,我们的结果证实,联合GP可以从估算的高密度基因型数据中获益,wGBLUP和MT-GBLUP方法有望用于猪育种中的联合GP。
    Joint genomic prediction (GP) is an attractive method to improve the accuracy of GP by combining information from multiple populations. However, many factors can negatively influence the accuracy of joint GP, such as differences in linkage disequilibrium phasing between single nucleotide polymorphisms (SNPs) and causal variants, minor allele frequencies and causal variants\' effect sizes across different populations. The objective of this study was to investigate whether the imputed high-density genotype data can improve the accuracy of joint GP using genomic best linear unbiased prediction (GBLUP), single-step GBLUP (ssGBLUP), multi-trait GBLUP (MT-GBLUP) and GBLUP based on genomic relationship matrix considering heterogenous minor allele frequencies across different populations (wGBLUP). Three traits, including days taken to reach slaughter weight, backfat thickness and loin muscle area, were measured on 67 276 Large White pigs from two different populations, for which 3334 were genotyped by SNP array. The results showed that a combined population could substantially improve the accuracy of GP compared with a single-population GP, especially for the population with a smaller size. The imputed SNP data had no effect for single population GP but helped to yield higher accuracy than the medium-density array data for joint GP. Of the four methods, ssGLBUP performed the best, but the advantage of ssGBLUP decreased as more individuals were genotyped. In some cases, MT-GBLUP and wGBLUP performed better than GBLUP. In conclusion, our results confirmed that joint GP could be beneficial from imputed high-density genotype data, and the wGBLUP and MT-GBLUP methods are promising for joint GP in pig breeding.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    未经评估:综合措施,比如28个关节的疾病活动评分(DAS28),是类风湿性关节炎(RA)试验的关键主要结局。DAS28在连续测量中组合了四个不同的组件。当这些组分中的一个或多个缺失时,在中间或试验终点评估中,总体综合评分也缺失。
    UNASSIGNED:这项研究检查了纵向RA试验中缺失的数据模式和机制,以评估在分析综合结果时如何最好地处理错误。
    未经批准:针对联合强化治疗(TACIT)的肿瘤坏死因子抑制剂试验是一个开放标签,实用随机多中心双臂非劣效性研究。患者随访12个月,每月测量综合结果及其组成部分。活动性RA患者随机接受常规疾病改善药物(cDMARDs)或肿瘤坏死因子-α抑制剂(TNFis)治疗。
    UNASSIGNED:TACIT试验用于探索复合结局中缺失数据的程度,DAS28.以图形方式检查了组件中缺失数据的模式和复合结果。纵向多变量逻辑回归分析评估随访期间缺失的数据机制。
    UNASSIGNED:二百零五名患者被随机分配:在12个月时,59/205(29%)有未观察到的复合结局,146/205(71%)有观察到的DAS28结局;然而,34/146缺少一个或多个中间评估。我们观察到混合的缺失数据模式,特别是对于由于一个成分缺失而不是患者没有参加他们的访问而导致的综合结局缺失。年龄和性别预测不良因素,提供强有力的证据,缺失的观察结果不太可能随机完全缺失(MCAR)。
    UASSIGNED:研究人员应在最后和中间时间点对缺失的数据模式和机制进行详细评估,无论结果变量是否是复合结果。此外,需要评估仅在里程碑评估时提供数据的患者对治疗估计的影响.
    UNASSIGNED:37438295。
    UNASSIGNED: Composite measures, like the Disease Activity Score for 28 joints (DAS28), are key primary outcomes in rheumatoid arthritis (RA) trials. DAS28 combines four different components in a continuous measure. When one or more of these components are missing the overall composite score is also missing at intermediate or trial endpoint assessments.
    UNASSIGNED: This study examined missing data patterns and mechanisms in a longitudinal RA trial to evaluate how best to handle missingness when analysing composite outcomes.
    UNASSIGNED: The Tumour-Necrosis-Factor Inhibitors against Combination Intensive Therapy (TACIT) trial was an open label, pragmatic randomized multicentre two arm non-inferiority study. Patients were followed up for 12 months, with monthly measurement of the composite outcome and its components. Active RA patients were randomized to conventional disease modifying drugs (cDMARDs) or Tumour Necrosis Factor-α inhibitors (TNFis).
    UNASSIGNED: The TACIT trial was used to explore the extent of missing data in the composite outcome, DAS28. Patterns of missing data in components and the composite outcome were examined graphically. Longitudinal multivariable logistic regression analysis assessed missing data mechanisms during follow-up.
    UNASSIGNED: Two hundred and five patients were randomized: at 12 months 59/205 (29%) had unobserved composite outcome and 146/205 (71%) had an observed DAS28 outcome; however, 34/146 had one or more intermediate assessments missing. We observed mixed missing data patterns, especially for the missing composite outcome due to one component missing rather than patient not attending thier visit. Age and gender predicted missingness components, providing strong evidence the missing observations were unlikely to be Missing Completely at Random (MCAR).
    UNASSIGNED: Researchers should undertake detailed evaluations of missing data patterns and mechanisms at the final and intermediate time points, whether or not the outcome variable is a composite outcome. In addition, the impact on treatment estimates in patients who only provide data at milestone assessments need to be assessed.
    UNASSIGNED: 37438295.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号