Imputation

Imputation
  • 文章类型: Journal Article
    肾结石疾病是一种广泛的泌尿系统疾病,影响全球数百万人。及时诊断对于避免严重并发症至关重要。传统上,使用计算机断层扫描(CT)检测肾结石,which,尽管它的有效性,是昂贵的,资源密集型,让病人暴露于不必要的辐射,并且通常由于放射学报告等待时间而导致延迟。这项研究提出了一种利用机器学习的新方法,利用常规实验室检测结果早期检测肾结石。我们利用了一个广泛的数据集,包括来自沙特阿拉伯医院的2156个患者记录,具有15个属性,具有数据缺失和类不平衡等挑战。我们评估了各种机器学习算法和插补方法,包括单一和多重归算,以及过采样和欠采样技术。我们的结果表明,基于集成树的分类器,特别是随机森林(RF)和额外的树分类器(ETree),以99%的显著准确率胜过其他人,召回率98%,RF的F1得分为99%,92%为ETree。这项研究强调了非侵入性,用于肾结石检测的具有成本效益的实验室检查,促进及时和改进的医疗支持。
    Kidney stone disease is a widespread urological disorder affecting millions globally. Timely diagnosis is crucial to avoid severe complications. Traditionally, renal stones are detected using computed tomography (CT), which, despite its effectiveness, is costly, resource-intensive, exposes patients to unnecessary radiation, and often results in delays due to radiology report wait times. This study presents a novel approach leveraging machine learning to detect renal stones early using routine laboratory test results. We utilized an extensive dataset comprising 2156 patient records from a Saudi Arabian hospital, featuring 15 attributes with challenges such as missing data and class imbalance. We evaluated various machine learning algorithms and imputation methods, including single and multiple imputations, as well as oversampling and undersampling techniques. Our results demonstrate that ensemble tree-based classifiers, specifically random forest (RF) and extra tree classifiers (ETree), outperform others with remarkable accuracy rates of 99%, recall rates of 98%, and F1 scores of 99% for RF, and 92% for ETree. This study underscores the potential of non-invasive, cost-effective laboratory tests for renal stone detection, promoting prompt and improved medical support.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    当前的全基因组单核苷酸多态性(SNP)基因分型方法会产生大量的缺失数据,这些数据可能会影响统计推断并偏向实验结果。基因型插补通常用于经过充分研究的物种,以缓冲下游分析的影响,和几种算法可用于填补缺失的基因型。缺乏参考单倍型组排除了在非模型生物的基因组研究中使用这些方法。作为替代,机器学习算法用于探索基因型数据并估计缺失的基因型。这里,我们提出了一种基于自组织映射(SOM)的插补方法,一种广泛使用的神经网络,由空间分布的神经元形成,将相似的输入聚集成接近的神经元。该方法探索基因型数据集以选择SNP基因座以从基因型中构建二元载体,并为每个查询缺失的SNP基因型初始化和训练神经网络。然后使用SOM衍生的聚类来估算最佳基因型。为了自动化估算过程,我们已经实施了gtImputation,一个用Python3编程的开源应用程序,并具有用户友好的GUI以促进整个过程。通过比较其准确性,验证了该方法的性能,使用其他可用的插补算法对几个基准基因型数据集的精度和灵敏度。我们的方法产生了高度准确和精确的基因型插补,即使对于具有低频率等位基因的SNP,优于其他算法,特别是对于来自具有无关个体的混合群体的数据集。
    Current methodologies of genome-wide single-nucleotide polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis, and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on self-organizing maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. The method explores genotype datasets to select SNP loci to build binary vectors from the genotypes, and initializes and trains neural networks for each query missing SNP genotype. The SOM-derived clustering is then used to impute the best genotype. To automate the imputation process, we have implemented gtImputation, an open-source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations even for SNPs with alleles at low frequency and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    生存分析的格局不断被彻底改变,以应对生物医学挑战,最近的统计挑战是审查协变量而不是结果。有许多有前途的策略来解决审查的协变量,包括加权,imputation,最大似然,和贝叶斯方法。尽管如此,这是一个比较新鲜的研究领域,与审查结果的领域不同(即,生存分析)或缺失协变量。在这次审查中,我们讨论了处理删失协变量时遇到的独特统计挑战,并对旨在解决这些挑战的现有方法进行了深入回顾.我们强调每种方法的相对优势和劣势,提供建议,帮助研究者查明处理数据中删失协变量的最佳方法。
    The landscape of survival analysis is constantly being revolutionized to answer biomedical challenges, most recently the statistical challenge of censored covariates rather than outcomes. There are many promising strategies to tackle censored covariates, including weighting, imputation, maximum likelihood, and Bayesian methods. Still, this is a relatively fresh area of research, different from the areas of censored outcomes (i.e., survival analysis) or missing covariates. In this review, we discuss the unique statistical challenges encountered when handling censored covariates and provide an in-depth review of existing methods designed to address those challenges. We emphasize each method\'s relative strengths and weaknesses, providing recommendations to help investigators pinpoint the best approach to handling censored covariates in their data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:数据缺失是基于质谱的代谢组学中的一个共同挑战,这可能导致有偏见和不完整的分析。将全基因组测序(WGS)数据与代谢组学数据整合已成为一种有希望的方法,可提高代谢组学研究中数据填补的准确性。
    方法:在本研究中,我们提出了一种新的方法,利用WGS数据和参考代谢物的信息来估算未知的代谢物。我们的方法利用多尺度变分自动编码器来联合建模负担分数,多遗传风险评分(PGS),和连锁不平衡(LD)修剪的单核苷酸多态性(SNP)用于特征提取和缺失的代谢组学数据填补。通过学习两个组学数据的潜在表示,我们的方法可以基于基因组信息有效地估算缺失的代谢组学值.
    结果:我们评估了我们的方法在具有缺失值的经验代谢组学数据集上的性能,并证明了其与常规插补技术相比的优越性。使用35种模板代谢物得出的负担评分,PGS和LD修剪的SNP,对于71.55%的代谢物,所提出的方法的R2得分>0.01.
    结论:在代谢组学插补中整合WGS数据不仅提高了数据完整性,而且增强了下游分析,为更全面和准确的代谢途径和疾病关联研究铺平了道路。我们的发现为利用WGS数据进行代谢组学数据插补的潜在好处提供了有价值的见解,并强调了在精准医学研究中利用多模式数据集成的重要性。
    BACKGROUND: Missing data is a common challenge in mass spectrometry-based metabolomics, which can lead to biased and incomplete analyses. The integration of whole-genome sequencing (WGS) data with metabolomics data has emerged as a promising approach to enhance the accuracy of data imputation in metabolomics studies.
    METHODS: In this study, we propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites. Our approach utilizes a multi-scale variational autoencoder to jointly model the burden score, polygenetic risk score (PGS), and linkage disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature extraction and missing metabolomics data imputation. By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values based on genomic information.
    RESULTS: We evaluate the performance of our method on empirical metabolomics datasets with missing values and demonstrate its superiority compared to conventional imputation techniques. Using 35 template metabolites derived burden scores, PGS and LD-pruned SNPs, the proposed methods achieved R2-scores > 0.01 for 71.55 % of metabolites.
    CONCLUSIONS: The integration of WGS data in metabolomics imputation not only improves data completeness but also enhances downstream analyses, paving the way for more comprehensive and accurate investigations of metabolic pathways and disease associations. Our findings offer valuable insights into the potential benefits of utilizing WGS data for metabolomics data imputation and underscore the importance of leveraging multi-modal data integration in precision medicine research.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    人类白细胞抗原(HLA)分子及其与自然杀伤(NK)细胞的关系,特别是通过它们与杀伤细胞免疫球蛋白样受体(KIR)的相互作用,与各种疾病的结果表现出强烈的关联。此外,HLA和KIR免疫系统基因的遗传变异提供了无限的复杂性。近年来,利用单核苷酸多态性(SNP)阵列的高性能全基因组关联研究(GWASs)的激增已经发生,大大推进了我们对疾病发病机制的理解。此外,HLA参考面板的进步实现了更高的分辨率和更可靠的填补,允许对序列变异和疾病风险之间的关联进行更细粒度的评估。然而,值得注意的是,这些GWASs中的大多数主要集中在高加索人和亚洲人的人群中,忽视拉丁美洲和非洲代表性不足的人口。这种遗漏不仅导致医疗保健获取方面的差异,而且限制了我们对这些被忽视人群中与疾病发病机理有关的新型遗传变异的认识。由于每个人群中普遍存在的KIR和HLA单倍型都是由特定环境明确建模的,这篇综述的目的是鼓励研究HLA/KIR参与感染和自身免疫性疾病,繁殖,以及在代表性不足的人群中移植。
    Human leukocyte antigen (HLA) molecules and their relationships with natural killer (NK) cells, specifically through their interaction with killer-cell immunoglobulin-like receptors (KIRs), exhibit robust associations with the outcomes of diverse diseases. Moreover, genetic variations in HLA and KIR immune system genes offer limitless depths of complexity. In recent years, a surge of high-powered genome-wide association studies (GWASs) utilizing single nucleotide polymorphism (SNP) arrays has occurred, significantly advancing our understanding of disease pathogenesis. Additionally, advances in HLA reference panels have enabled higher resolution and more reliable imputation, allowing for finer-grained evaluation of the association between sequence variations and disease risk. However, it is essential to note that the majority of these GWASs have focused primarily on populations of Caucasian and Asian origins, neglecting underrepresented populations in Latin America and Africa. This omission not only leads to disparities in health care access but also restricts our knowledge of novel genetic variants involved in disease pathogenesis within these overlooked populations. Since the KIR and HLA haplotypes prevalent in each population are clearly modelled by the specific environment, the aim of this review is to encourage studies investigating HLA/KIR involvement in infection and autoimmune diseases, reproduction, and transplantation in underrepresented populations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    用于人类白细胞抗原(HLA)分型的基于SNP的插补方法利用了主要组织相容性复合物(MHC)区域内的单倍型结构。这些方法使用密集的SNP基因型预测HLA经典等位基因,通常在全基因组关联研究(GWAS)中使用的基于阵列的平台上发现。HLA经典等位基因的分析可以在没有额外成本的情况下在当前SNP数据集上进行。这里,我们描述了HIBAG的工作流程,一种带有属性装袋的插补方法,使用SNP数据推断样本的HLA经典等位基因。提供了两个示例来演示使用1000Genomes项目最新版本的公共HLA和SNP数据的功能:使用GWAS中预先构建的分类器进行基因型填补,和模型训练,以创建新的预测模型。GPU实现有助于模型构建,使它比单线程实现快数百倍。
    SNP-based imputation approaches for human leukocyte antigen (HLA) typing take advantage of the haplotype structure within the major histocompatibility complex (MHC) region. These methods predict HLA classical alleles using dense SNP genotypes, commonly found on array-based platforms used in genome-wide association studies (GWAS). The analysis of HLA classical alleles can be conducted on current SNP datasets at no additional cost. Here, we describe the workflow of HIBAG, an imputation method with attribute bagging, to infer a sample\'s HLA classical alleles using SNP data. Two examples are offered to demonstrate the functionality using public HLA and SNP data from the latest release of the 1000 Genomes project: genotype imputation using pre-built classifiers in a GWAS, and model training to create a new prediction model. The GPU implementation facilitates model building, making it hundreds of times faster compared to the single-threaded implementation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    DNA甲基化是一种重要的表观遗传标记,它通过抑制转录蛋白与DNA的结合来调节基因表达。和许多其他组学实验一样,缺失值的问题是一个重要的问题,和适当的插补技术对于避免不必要的样本量减少以及最佳地利用收集的信息非常重要。我们考虑的情况是,通过昂贵的高密度全基因组亚硫酸氢盐测序(WGBS)策略处理相对较少的样品,而使用更实惠的低密度处理大量样品,基于阵列的技术。在这种情况下,可以使用WGBS样品提供的高密度信息估算低覆盖率(基于阵列)的甲基化数据。在本文中,我们提出了一个有效的带有信息协变量的共区域化线性模型(LMCC),以根据观察值和协变量预测缺失值。我们的模型假设在每个站点,所有样本的甲基化载体与固定因子(协变量)和潜在因子的集合相关。此外,我们通过在固定和潜在系数向量上假设一些高斯过程来利用数据的功能性质和站点之间的空间相关性,分别。我们的模拟表明,使用协变量可以显着提高估算值的准确性,特别是在缺失数据包含有关解释变量的一些相关信息的情况下。我们还表明,当列数远大于行数时,我们提出的模型特别有效-在甲基化数据分析中通常是这种情况。最后,我们在两个真实的甲基化数据集上应用并比较了我们提出的方法与替代方法,显示协变量,如细胞类型,组织类型或年龄可以提高估算值的准确性。
    DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    低覆盖率全基因组测序(LCS)为st鱼育种提供了一种具有成本效益的替代方法,特别是考虑到缺乏SNP芯片和与全基因组测序相关的高成本。在这项研究中,在643只测序的俄罗斯st鱼(〜13.68×)中评估了LCS基因型填补和基因组预测的效率。结果表明,使用BaseVarSTITCH在2倍的测序深度,样本量大于300导致最高的基因分型准确性。此外,当测序深度达到0.5倍,SNP密度通过连锁不平衡修剪降低到50K时,预测精度与整个测序深度相当.此外,增量特征选择方法有可能提高预测精度。这项研究表明,LCS和归因的结合可以是一种具有成本效益的策略,有助于经济性状的遗传改善和促进水产养殖物种的遗传增益。
    Low-coverage whole-genome sequencing (LCS) offers a cost-effective alternative for sturgeon breeding, especially given the lack of SNP chips and the high costs associated with whole-genome sequencing. In this study, the efficiency of LCS for genotype imputation and genomic prediction was assessed in 643 sequenced Russian sturgeons (∼13.68×). The results showed that using BaseVar+STITCH at a sequencing depth of 2× with a sample size larger than 300 resulted in the highest genotyping accuracy. In addition, when the sequencing depth reached 0.5× and SNP density was reduced to 50 K through linkage disequilibrium pruning, the prediction accuracy was comparable to that of whole sequencing depth. Furthermore, an incremental feature selection method has the potential to improve prediction accuracy. This study suggests that the combination of LCS and imputation can be a cost-effective strategy, contributing to the genetic improvement of economic traits and promoting genetic gains in aquaculture species.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    多变量时间序列数据中缺失值的填补是一种基础和流行的数据处理技术。最近,一些研究利用递归神经网络(RNN)和生成对抗网络(GAN)来估算/填充多变量时间序列数据中的缺失值。然而,当面对高缺失率的数据集时,这些方法的归责误差急剧增加。为此,我们提出了一种基于动态贡献和注意力的神经网络模型,表示为ContrattNet。ContrattNet由三个新颖的模块组成:功能注意模块,iLSTM(插补长短期记忆)模块,和1D-CNN(一维卷积神经网络)模块。ContrAttNet利用时间信息和空间特征信息来预测缺失值,其中,iLSTM根据缺失值的特征衰减LSTM的内存,学习不同功能的贡献。此外,特征注意力模块引入了基于贡献的注意力机制,计算监督权重。此外,在这些监督权重的影响下,1D-CNN通过将时间序列数据视为空间特征来处理它们。实验结果表明,ContrattNet在多变量时间序列数据的缺失值填补方面优于其他最新模型,基准数据集上的平均MAPE为6%,MAE为9%。
    The imputation of missing values in multivariate time-series data is a basic and popular data processing technology. Recently, some studies have exploited Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GANs) to impute/fill the missing values in multivariate time-series data. However, when faced with datasets with high missing rates, the imputation error of these methods increases dramatically. To this end, we propose a neural network model based on dynamic contribution and attention, denoted as ContrAttNet. ContrAttNet consists of three novel modules: feature attention module, iLSTM (imputation Long Short-Term Memory) module, and 1D-CNN (1-Dimensional Convolutional Neural Network) module. ContrAttNet exploits temporal information and spatial feature information to predict missing values, where iLSTM attenuates the memory of LSTM according to the characteristics of the missing values, to learn the contributions of different features. Moreover, the feature attention module introduces an attention mechanism based on contributions, to calculate supervised weights. Furthermore, under the influence of these supervised weights, 1D-CNN processes the time-series data by treating them as spatial features. Experimental results show that ContrAttNet outperforms other state-of-the-art models in the missing value imputation of multivariate time-series data, with average 6% MAPE and 9% MAE on the benchmark datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    部分观察到的混淆数据对电子健康记录(EHR)的统计分析提出了挑战,并且缺乏对潜在潜在潜在错误机制的系统评估。我们旨在提供一种有原则的方法来根据经验描述缺失的数据过程并研究分析方法的性能。
    糖尿病SGLT2或DPP4抑制剂引发剂的三个经验子队列,具有关于HbA1c的完整信息,BMI和吸烟作为感兴趣的混杂因素(COI)构成了等离子体模型框架下数据模拟的基础。真正的无效治疗效果,包括结果生成模型中的COI,并模拟了COI的四种错误机制:完全随机(MCAR),随机(MAR),和两种非随机(MNAR)机制,其中错误取决于无法衡量的混淆者和COI本身的价值。我们评估了三组诊断区分机制的能力:1)-有或没有观察到的COI的患者之间的特征差异(使用平均标准化平均差[ASMD]),2)-基于观察到的协变量的错误指标的预测能力,和3)-不良指标与结果的关联。然后,我们比较了分析方法,包括“完整案例”,逆概率加权,单一和多重补偿他们恢复真正治疗效果的能力。
    诊断成功地确定了模拟错误机制的特征模式。对于MAR,但不是MCAR,患者特征显示出实质性差异(ASMD中位数0.20vs0.05),因此,错误预测模型的辨别度也较高(0.59比0.50)。对于MNAR,但不是MAR或MCAR,即使在调整其他观察到的协变量的模型中,错误也与结果显着相关。比较分析方法,使用随机森林算法进行多重插补的结果是最小的均方根误差。
    原理诊断为错误机制提供了可靠的见解。当假设允许时,使用非参数模型进行多重填补可以帮助减少偏差。
    UNASSIGNED: Partially observed confounder data pose challenges to the statistical analysis of electronic health records (EHR) and systematic assessments of potentially underlying missingness mechanisms are lacking. We aimed to provide a principled approach to empirically characterize missing data processes and investigate performance of analytic methods.
    UNASSIGNED: Three empirical sub-cohorts of diabetic SGLT2 or DPP4-inhibitor initiators with complete information on HbA1c, BMI and smoking as confounders of interest (COI) formed the basis of data simulation under a plasmode framework. A true null treatment effect, including the COI in the outcome generation model, and four missingness mechanisms for the COI were simulated: completely at random (MCAR), at random (MAR), and two not at random (MNAR) mechanisms, where missingness was dependent on an unmeasured confounder and on the value of the COI itself. We evaluated the ability of three groups of diagnostics to differentiate between mechanisms: 1)-differences in characteristics between patients with or without the observed COI (using averaged standardized mean differences [ASMD]), 2)-predictive ability of the missingness indicator based on observed covariates, and 3)-association of the missingness indicator with the outcome. We then compared analytic methods including \"complete case\", inverse probability weighting, single and multiple imputation in their ability to recover true treatment effects.
    UNASSIGNED: The diagnostics successfully identified characteristic patterns of simulated missingness mechanisms. For MAR, but not MCAR, the patient characteristics showed substantial differences (median ASMD 0.20 vs 0.05) and consequently, discrimination of the prediction models for missingness was also higher (0.59 vs 0.50). For MNAR, but not MAR or MCAR, missingness was significantly associated with the outcome even in models adjusting for other observed covariates. Comparing analytic methods, multiple imputation using a random forest algorithm resulted in the lowest root-mean-squared-error.
    UNASSIGNED: Principled diagnostics provided reliable insights into missingness mechanisms. When assumptions allow, multiple imputation with nonparametric models could help reduce bias.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号