Imputation

Imputation
  • 文章类型: Journal Article
    肾结石疾病是一种广泛的泌尿系统疾病,影响全球数百万人。及时诊断对于避免严重并发症至关重要。传统上,使用计算机断层扫描(CT)检测肾结石,which,尽管它的有效性,是昂贵的,资源密集型,让病人暴露于不必要的辐射,并且通常由于放射学报告等待时间而导致延迟。这项研究提出了一种利用机器学习的新方法,利用常规实验室检测结果早期检测肾结石。我们利用了一个广泛的数据集,包括来自沙特阿拉伯医院的2156个患者记录,具有15个属性,具有数据缺失和类不平衡等挑战。我们评估了各种机器学习算法和插补方法,包括单一和多重归算,以及过采样和欠采样技术。我们的结果表明,基于集成树的分类器,特别是随机森林(RF)和额外的树分类器(ETree),以99%的显著准确率胜过其他人,召回率98%,RF的F1得分为99%,92%为ETree。这项研究强调了非侵入性,用于肾结石检测的具有成本效益的实验室检查,促进及时和改进的医疗支持。
    Kidney stone disease is a widespread urological disorder affecting millions globally. Timely diagnosis is crucial to avoid severe complications. Traditionally, renal stones are detected using computed tomography (CT), which, despite its effectiveness, is costly, resource-intensive, exposes patients to unnecessary radiation, and often results in delays due to radiology report wait times. This study presents a novel approach leveraging machine learning to detect renal stones early using routine laboratory test results. We utilized an extensive dataset comprising 2156 patient records from a Saudi Arabian hospital, featuring 15 attributes with challenges such as missing data and class imbalance. We evaluated various machine learning algorithms and imputation methods, including single and multiple imputations, as well as oversampling and undersampling techniques. Our results demonstrate that ensemble tree-based classifiers, specifically random forest (RF) and extra tree classifiers (ETree), outperform others with remarkable accuracy rates of 99%, recall rates of 98%, and F1 scores of 99% for RF, and 92% for ETree. This study underscores the potential of non-invasive, cost-effective laboratory tests for renal stone detection, promoting prompt and improved medical support.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    鉴于这些观测值受到测量误差(ME)的污染,因此该手稿很少开发有效的差异和比率类型的插补来处理缺少观测值的情况。通过采用泰勒级数展开式,研究了已开发的插补的均方误差,使其达到初级近似。拟议的估算与文献中提出的最新的估算等同。通过利用一些真实和假设创建的种群进行广泛的实证研究来评估所提出的估算的执行情况。针对实际应用的抽样受访者进行了适当的评论。
    This manuscript develops few efficient difference and ratio kinds of imputations to handle the situation of missing observations given that these observations are polluted by the measurement errors (ME). The mean square errors of the developed imputations are studied to the primary degree approximation by adopting Taylor series expansion. The proposed imputations are equated with the latest existing imputations presented in the literature. The execution of the proposed imputations is assessed by utilizing a broad empirical study utilizing some real and hypothetically created populations. Appropriate remarks are made for sampling respondents regarding practical applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    作为优化方法,以确定用于密集基因分型的最佳动物,以构建用于基因型填补的参考种群,MCA和MCG方法,使用基于谱系的加性遗传关系矩阵(A矩阵)和基因组关系矩阵(G矩阵),分别,已被提议。我们使用575头日本黑牛评估了MCA和MCG方法的性能。提供谱系数据以追溯到五代以构建A矩阵,其中谱系深度从1改变为5(五种MCA方法)。基于VanRaden方法1和2(两种MCG方法),使用36,426个单核苷酸多态性的基因型信息来计算G矩阵。MCG每次迭代总是选择一头牛,而MCA有时会选择多头牛。MCA和MCG方法之间通常选择的母牛的数量通常低于不同MCA方法之间或不同MCG方法之间的数量。对于被研究的人群,MCG似乎比MCA更合理,可以选择奶牛作为参考群体,进行高密度基因型填补,以进行基因组预测和全基因组关联研究。
    As optimization methods to identify the best animals for dense genotyping to construct a reference population for genotype imputation, the MCA and MCG methods, which use the pedigree-based additive genetic relationship matrix (A matrix) and the genomic relationship matrix (G matrix), respectively, have been proposed. We assessed the performance of MCA and MCG methods using 575 Japanese Black cows. Pedigree data were provided to trace back up to five generations to construct the A matrix with changing the pedigree depth from 1 to 5 (five MCA methods). Genotype information on 36,426 single-nucleotide polymorphisms was used to calculate the G matrix based on VanRaden\'s methods 1 and 2 (two MCG methods). The MCG always selected one cow per iteration, while MCA sometimes selected multiple cows. The number of commonly selected cows between the MCA and MCG methods was generally lower than that between different MCA methods or between different MCG methods. For the studied population, MCG appeared to be more reasonable than MCA in selecting cows as a reference population for higher-density genotype imputation to perform genomic prediction and a genome-wide association study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    联合基因组预测(GP)是一种有吸引力的方法,可以通过组合来自多个种群的信息来提高GP的准确性。然而,许多因素会对联合GP的准确性产生负面影响,例如单核苷酸多态性(SNP)和因果变异之间的连锁不平衡定相差异,次要等位基因频率和因果变异在不同人群中的影响大小。这项研究的目的是调查是否输入的高密度基因型数据可以使用基因组最佳线性无偏预测(GBLUP)提高联合GP的准确性,单步GBLUP(ssGBLUP),多性状GBLUP(MT-GBLUP)和基于基因组关系矩阵的GBLUP,考虑了不同群体中异质次要等位基因频率(wGBLUP)。三个特征,包括达到屠宰体重所需的天数,背脂厚度和腰肌面积,对来自两个不同种群的67276头大型白猪进行了测量,通过SNP阵列对3334进行了基因分型。结果表明,与单种群GP相比,组合种群可以大大提高GP的准确性,特别是对于人口规模较小的人群。估算的SNP数据对单个种群GP没有影响,但有助于产生比联合GP的中密度阵列数据更高的准确性。在这四种方法中,ssGLBUP表现最好,但是ssGBLUP的优势随着更多个体的基因分型而降低。在某些情况下,MT-GBLUP和wGBLUP的表现优于GBLUP。总之,我们的结果证实,联合GP可以从估算的高密度基因型数据中获益,wGBLUP和MT-GBLUP方法有望用于猪育种中的联合GP。
    Joint genomic prediction (GP) is an attractive method to improve the accuracy of GP by combining information from multiple populations. However, many factors can negatively influence the accuracy of joint GP, such as differences in linkage disequilibrium phasing between single nucleotide polymorphisms (SNPs) and causal variants, minor allele frequencies and causal variants\' effect sizes across different populations. The objective of this study was to investigate whether the imputed high-density genotype data can improve the accuracy of joint GP using genomic best linear unbiased prediction (GBLUP), single-step GBLUP (ssGBLUP), multi-trait GBLUP (MT-GBLUP) and GBLUP based on genomic relationship matrix considering heterogenous minor allele frequencies across different populations (wGBLUP). Three traits, including days taken to reach slaughter weight, backfat thickness and loin muscle area, were measured on 67 276 Large White pigs from two different populations, for which 3334 were genotyped by SNP array. The results showed that a combined population could substantially improve the accuracy of GP compared with a single-population GP, especially for the population with a smaller size. The imputed SNP data had no effect for single population GP but helped to yield higher accuracy than the medium-density array data for joint GP. Of the four methods, ssGLBUP performed the best, but the advantage of ssGBLUP decreased as more individuals were genotyped. In some cases, MT-GBLUP and wGBLUP performed better than GBLUP. In conclusion, our results confirmed that joint GP could be beneficial from imputed high-density genotype data, and the wGBLUP and MT-GBLUP methods are promising for joint GP in pig breeding.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在过去的几年中,人们观察到生命科学数据收集的进展为先进的生物信息学创造了越来越多的需求和机会。这包括数据管理以及个人数据分析,通常涵盖整个数据生命周期。已经开发了各种工具来存储,share,或重用在不同领域产生的数据,如基因分型。尤其是归责,作为基因分型的一个子领域,需要良好的研究数据管理(RDM)策略,以实现基因型数据的使用和重用。为了实现可持续软件,有必要开发工具和周围的生态系统,它们是可重用和可维护的。流线型工具上下文中的可重用性可以例如通过标准化不同工具的输入和输出并适应开放和广泛使用的文件格式来实现。通过使用这种已建立的文件格式,这些工具也可以与其他人连接,提高软件的整体互操作性。最后,重要的是建立强大的社区,通过开发和提供新功能和维护更新来维护工具。在这篇文章中,这方面的概念将针对归因服务提出。
    Over the last years it has been observed that the progress in data collection in life science has created increasing demand and opportunities for advanced bioinformatics. This includes data management as well as the individual data analysis and often covers the entire data life cycle. A variety of tools have been developed to store, share, or reuse the data produced in the different domains such as genotyping. Especially imputation, as a subfield of genotyping, requires good Research Data Management (RDM) strategies to enable use and re-use of genotypic data. To aim for sustainable software, it is necessary to develop tools and surrounding ecosystems, which are reusable and maintainable. Reusability in the context of streamlined tools can e.g. be achieved by standardizing the input and output of the different tools and adapting to open and broadly used file formats. By using such established file formats, the tools can also be connected with others, improving the overall interoperability of the software. Finally, it is important to build strong communities that maintain the tools by developing and contributing new features and maintenance updates. In this article, concepts for this will be presented for an imputation service.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在典型的发展中人口中开展了越来越多的大规模多模式研究活动,例如开发。Cogn.Neur.32:43-54,2018;PLoSMed。12(3):e1001779,2015;Elam和VanEssen,Enc.Comp.Neur.,2013年,以及在精神病队列中,例如Trans.心理10(1):100,2020年;摩尔。心理19:659-667,2014;Mol。Aut.2017年8:24;欧元。孩子和Adol心理24(3):265-281,2015。由于难以评估大量参与者的多种措施,因此缺少数据是此类数据集中的常见问题。当研究人员旨在整合多个指标之间的关系时,数据缺失的后果就会累积起来。在这里,我们旨在评估不同的填补策略,以填补来自N=453个自闭症个体和N=311个对照个体的大量(总计N=764)和深度表型(即所施用的临床和认知工具范围)样本的临床数据中的缺失值作为EU-AIMS纵向欧洲自闭症项目(LEAP)联盟的一部分。特别是,我们考虑总共160项临床措施,分为15个重叠的参与者亚组.我们使用两种简单但常见的单变量策略-均值和中位数插补-以及涉及四个独立的多元回归模型的RoundRobin回归方法,包括贝叶斯岭回归,以及几个非线性模型:决策树(额外的树。,和最近的邻居回归。我们使用传统的均方误差对删除的可用数据进行评估,并考虑了观测分布和估算分布之间的Kullback-Leibler分歧。我们表明,与典型的单变量方法相比,所有测试的多变量方法都提供了实质性的改进。Further,我们的分析表明,在所有15个数据子集测试中,额外的树木回归方法提供了最好的全局结果。这不仅允许选择一个独特的模型来为LEAP项目估算缺失的数据,并提供一组固定的估算临床数据,供将来使用LEAP数据集的研究人员使用。但为大规模流行病学研究中的数据填补提供了更一般的指导。
    An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, e.g. Dev. Cogn. Neur. 32:43-54, 2018; PLoS Med. 12(3):e1001779, 2015; Elam and Van Essen, Enc. Comp. Neur., 2013, as well as in psychiatric cohorts, e.g. Trans. Psych. 10(1):100, 2020; Mol. Psych. 19:659-667, 2014; Mol. Aut. 8:24, 2017; Eur. Child and Adol. Psych. 24(3):265-281, 2015. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to integrate relationships across multiple measures. Here we aim to evaluate different imputation strategies to fill in missing values in clinical data from a large (total N = 764) and deeply phenotyped (i.e. range of clinical and cognitive instruments administered) sample of N = 453 autistic individuals and N = 311 control individuals recruited as part of the EU-AIMS Longitudinal European Autism Project (LEAP) consortium. In particular, we consider a total of 160 clinical measures divided in 15 overlapping subsets of participants. We use two simple but common univariate strategies-mean and median imputation-as well as a Round Robin regression approach involving four independent multivariate regression models including Bayesian Ridge regression, as well as several non-linear models: Decision Trees (Extra Trees., and Nearest Neighbours regression. We evaluate the models using the traditional mean square error towards removed available data, and also consider the Kullback-Leibler divergence between the observed and the imputed distributions. We show that all of the multivariate approaches tested provide a substantial improvement compared to typical univariate approaches. Further, our analyses reveal that across all 15 data-subsets tested, an Extra Trees regression approach provided the best global results. This not only allows the selection of a unique model to impute missing data for the LEAP project and delivers a fixed set of imputed clinical data to be used by researchers working with the LEAP dataset in the future, but provides more general guidelines for data imputation in large scale epidemiological studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    我们考虑从电子健康记录(EHR)收集的病例对照研究的分析,其中病例库被不符合研究条件的患者污染。这些不合格的病人,被称为“虚假案例”,如果已知,则应将\"从分析中排除。然而,病例池中患者的真实结局状态是未知的,除了一个子集的大小与整个病例池相比可能任意小.为了有效地消除假案对估计由逻辑形式的工作关联模型定义的赔率比参数的影响,我们提出了一种通用策略,可以自适应地推算未知病例状态,而不需要正确的表型模型来帮助辨别真假病例状态。我们的方法将目标参数估计为使用所有可用数据构建的一组无偏估计方程的解。通过实现对结果状态和感兴趣的协变量之间的关系进行错误建模的鲁棒性,它优于现有方法,以及提高估计效率。我们进一步证明了我们的估计量是根-n一致的和渐近正态的。通过广泛的模拟研究和对真实EHR数据的分析,我们证明了我们的方法对关联和表型模型的可能错误指定都具有理想的鲁棒性,统计效率优于竞争对手。
    We consider analyses of case-control studies assembled from electronic health records (EHRs) where the pool of cases is contaminated by patients who are ineligible for the study. These ineligible patients, referred to as \"false cases,\" should be excluded from the analyses if known. However, the true outcome status of a patient in the case pool is unknown except in a subset whose size may be arbitrarily small compared to the entire pool. To effectively remove the influence of the false cases on estimating odds ratio parameters defined by a working association model of the logistic form, we propose a general strategy to adaptively impute the unknown case status without requiring a correct phenotyping model to help discern the true and false case statuses. Our method estimates the target parameters as the solution to a set of unbiased estimating equations constructed using all available data. It outperforms existing methods by achieving robustness to mismodeling the relationship between the outcome status and covariates of interest, as well as improved estimation efficiency. We further show that our estimator is root-n-consistent and asymptotically normal. Through extensive simulation studies and analysis of real EHR data, we demonstrate that our method has desirable robustness to possible misspecification of both the association and phenotyping models, along with statistical efficiency superior to the competitors.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Although imputation of missing SNP results has been widely used in genetic studies, claims about the quality and usefulness of imputation have outnumbered the few studies that have questioned its limitations. But it is becoming clear that these limitations are real-for example, disease association signals can be missed in regions of LD breakdown. Here, as a case study, using the chromosomal region of the well-known lactase gene, LCT, we address the issue of imputation in the context of variants that have become frequent in a limited number of modern population groups only recently, due to selection. We study SNPs in a 500 bp region covering the enhancer of LCT, and compare imputed genotypes with directly genotyped data. We examine the haplotype pairs of all individuals with discrepant and missing genotypes. We highlight the nonrandom nature of the allelic errors and show that most incorrect imputations and missing data result from long haplotypes that are evolutionarily closely related to those carrying the derived alleles, while some relate to rare and recombinant haplotypes. We conclude that bias of incorrectly imputed and missing genotypes can decrease the accuracy of imputed results substantially.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    杀伤细胞免疫球蛋白样受体(KIR)通过与HLA配体的相互作用调节NK和CD8+T细胞中的免疫应答。KIR基因,包括KIR2DS1,KIR3DL1和KIR3DS1以前曾被认为与银屑病易感性有关.然而,这些以前的研究仅限于小样本量,部分原因是KIR基因直接分型所需的时间和费用。这里,我们实施了KIR*IMP,从发现队列(n=11,912)的19号染色体上的单核苷酸多态性(SNPs),从PAGE联盟,加州大学旧金山分校,邓迪大学,以及来自北加利福尼亚KaiserPermanente的复制队列(n=66,357)。考虑患者血统和高风险HLA等位基因的分层多变量逻辑回归显示,在发现队列中,KIR2DL2拷贝数与银屑病显着相关(p≤0.05)。在KaiserPermanente复制队列中复制KIR2DL2拷贝数关联。这是首次报道的KIR2DL2拷贝数与银屑病的关联,并强调了KIR遗传学在银屑病发病机制中的重要性。
    Killer cell immunoglobulin-like receptors (KIR) regulate immune responses in NK and CD8+ T cells via interaction with HLA ligands. KIR genes, including KIR2DS1, KIR3DL1, and KIR3DS1 have previously been implicated in psoriasis susceptibility. However, these previous studies were constrained to small sample sizes, in part due to the time and expense required for direct genotyping of KIR genes. Here, we implemented KIR*IMP to impute KIR copy number from single-nucleotide polymorphisms (SNPs) on chromosome 19 in the discovery cohort (n=11,912) from the PAGE consortium, University of California San Francisco, and the University of Dundee, and in a replication cohort (n=66,357) from Kaiser Permanente Northern California. Stratified multivariate logistic regression that accounted for patient ancestry and high-risk HLA alleles revealed that KIR2DL2 copy number was significantly associated with psoriasis in the discovery cohort (p ≤ 0.05). The KIR2DL2 copy number association was replicated in the Kaiser Permanente replication cohort. This is the first reported association of KIR2DL2 copy number with psoriasis and highlights the importance of KIR genetics in the pathogenesis of psoriasis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Minimizing bias in randomized controlled trials (RCTs) includes intention-to-treat analyses. Hospice/palliative care RCTs are constrained by high attrition unpredictable when consenting, including withdrawals between randomization and first exposure to the intervention. Such withdrawals may systematically bias findings away from the new intervention being evaluated if they are considered nonresponders.
    This study aimed to quantify the impact within intention-to-treat principles.
    A theoretical model was developed to assess the impact of withdrawals between randomization and first exposure on study power and effect sizes. Ten reported hospice/palliative care studies had power recalculated accounting for such withdrawal.
    In the theoretical model, when 5% of withdrawals occurred between randomization and first exposure to the intervention, change in power was demonstrated in binary outcomes (2.0%-2.2%), continuous outcomes (0.8%-2.0%), and time-to-event outcomes (1.6%-2.0%), and odds ratios were changed by 0.06-0.17. Greater power loss was observed with larger effect sizes. Withdrawal rates were 0.9%-10% in the 10 reported RCTs, corresponding to power losses of 0.1%-2.2%. For studies with binary outcomes, withdrawal rates were 0.3%-1.2% changing odds ratios by 0.01-0.22.
    If blinding is maintained and all interventions are available simultaneously, our model suggests that excluding data from withdrawals between randomization and first exposure to the intervention minimizes one bias. This is the safety population as defined by the International Committee on Harmonization. When planning for future trials, minimizing the time between randomization and first exposure to the intervention will minimize the problem. Power should be calculated on people who receive the intervention.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号