Imputation

Imputation
  • 文章类型: Journal Article
    已知排序集采样(RSS)可以提高估计器的效率,同时将其与简单的随机抽样进行比较。错误的问题在继续进行估计之前需要解决的信息中造成了差距。已经进行了少量的工作来处理利用RSS的错误。本文提出了一些利用辅助信息估计RSS下总体均值的对数型插补方法。检查了建议的估算程序的属性。完成了仿真研究,以表明与某些现有的插补程序相比,所提出的插补程序具有更好的结果。还提供了所提出的插补程序的实际应用来概括仿真研究。
    Ranked set sampling (RSS) is known to increase the efficiency of the estimators while comparing it with simple random sampling. The problem of missingness creates a gap in the information that needs to be addressed before proceeding for estimation. Negligible amount of work has been carried out to deal with missingness utilizing RSS. This paper proposes some logarithmic type methods of imputation for the estimation of population mean under RSS using auxiliary information. The properties of the suggested imputation procedures are examined. A simulation study is accomplished to show that the proposed imputation procedures exhibit better results in comparison to some of the existing imputation procedures. Few real applications of the proposed imputation procedures is also provided to generalize the simulation study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在美国,痴呆症结局的种族差异仍然存在。针对可修改的风险因素,包括心血管危险因素(CVRF),是减少健康差距的一种可以想象的方法。非白人成年人的生命历程CVRF通常较高,并且与痴呆症的风险有关。但尚不清楚它们是否会导致痴呆症和认知方面的种族差异。
    使用4,159名65-95岁的白人和939名黑人参与者的合并队列,我们进行了调解分析,以估计种族对痴呆症的比例影响,这可以通过在一生中估算的四个CVRF(20-49、50-69和70-89岁)来解释:体重指数,空腹血糖,收缩压,和低密度脂蛋白胆固醇.
    与白人参与者相比,黑人参与者患痴呆症的风险更高(校正后OR=1.37;95%CI:1.17-1.60)。生命过程中的BMI和空腹血糖是种族对痴呆风险影响的重要媒介,介导39.1%(95%CI:10.5-67.8%)和8.2%(95%CI:0.1-16.2%)的效应,根据性别和年龄进行调整。所有四个CVRF都是种族对全球认知和处理速度的影响的重要媒介,约占效果的11%。
    我们发现生命过程中的CVRF部分解释了晚年痴呆风险和认知的差异。在整个生命过程中改善CVRF的预防和治疗对于减少痴呆症的健康差异可能很重要。
    UNASSIGNED: Racial disparities in dementia outcomes persist in the United States. Targeting modifiable risk factors, including cardiovascular risk factors (CVRFs), is a conceivable way to reduce health disparities. Life course CVRFs are often higher in non-White adults and are associated with risk of dementia, but it is unknown whether they contribute to racial disparities in dementia and cognition.
    UNASSIGNED: Using a pooled cohort of 4,159 White and 939 Black participants aged 65-95 years, we conducted a mediation analysis to estimate the proportional effect of race on dementia that is explained by four CVRFs imputed over the life course (20-49, 50-69, and 70-89 years of age): body mass index, fasting glucose, systolic blood pressure, and low-density lipoprotein cholesterol.
    UNASSIGNED: Compared to White participants, Black participants had greater risk of dementia (adjusted OR = 1.37; 95% CI: 1.17-1.60). BMI and fasting glucose over the life course were significant mediators of the effect of race on dementia risk, mediating 39.1% (95% CI: 10.5-67.8%) and 8.2% (95% CI: 0.1-16.2%) of the effect, adjusted for sex and age. All four CVRFs together were also significant mediators of the effect of race on scores on global cognition and processing speed, accounting for approximately 11% of the effect.
    UNASSIGNED: We found that CVRFs across the life course partially explain disparities in dementia risk and cognition in late-life. Improved prevention and treatment of CVRFs across the life course may be important to reduce health disparities for dementia.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    头孢菌素类抗生素广泛应用于临床,但它们会引起过敏反应,这可能受到遗传因素的影响,例如人类白细胞抗原(HLA)分子的表达。这项研究旨在调查特定的HLA等位基因是否与台湾人群中头孢菌素不良反应的风险增加有关。这项回顾性病例对照研究分析了台湾精准医学倡议(TPMI)对27,933名接受头孢菌素暴露并具有HLA等位基因基因分型信息的个体的数据。使用逻辑回归分析,我们检查了HLA基因型之间的关联,合并症,过敏风险,和严重性。在研究人群中,278例患者有头孢菌素过敏,对照组2780例。我们的结果表明某些HLA等位基因,包括HLA-B*55:02(OR=1.76,95%CI1.18-2.61,p=0.005),HLA-C*01:02(OR=1.36,95%CI1.05-1.77,p=0.018),和HLA-DQB1*06:09(OR=2.58,95%CI1.62-4.12,p<0.001),与头孢菌素过敏反应的风险增加显著相关。此外,HLA-C*01:02等位基因基因型与严重过敏的高风险显著相关(OR=2.33,95%CI1.05-5.15,p=0.04).这项研究确定了HLA等位基因与头孢菌素过敏风险增加之间的显着关联。有助于早期发现和预测头孢菌素类药物的不良反应。此外,我们的研究强调了HLA分型在药物安全性方面的重要性,并扩大了我们对药物超敏反应综合征的认识.
    Cephalosporin antibiotics are widely used in clinical settings, but they can cause hypersensitivity reactions, which may be influenced by genetic factors such as the expression of Human leukocyte antigen (HLA) molecules. This study aimed to investigate whether specific HLA alleles were associated with an increased risk of adverse reactions to cephalosporins among individuals in the Taiwanese population. This retrospective case-control study analyzed data from the Taiwan Precision Medicine Initiative (TPMI) on 27,933 individuals who received cephalosporin exposure and had HLA allele genotyping information available. Using logistic regression analyses, we examined the associations between HLA genotypes, comorbidities, allergy risk, and severity. Among the study population, 278 individuals had cephalosporin allergy and 2780 were in the control group. Our results indicated that certain HLA alleles, including HLA-B*55:02 (OR = 1.76, 95% CI 1.18-2.61, p = 0.005), HLA-C*01:02 (OR = 1.36, 95% CI 1.05-1.77, p = 0.018), and HLA-DQB1*06:09 (OR = 2.58, 95% CI 1.62-4.12, p < 0.001), were significantly associated with an increased risk of cephalosporin allergy reactions. Additionally, the HLA-C*01:02 allele genotype was significantly associated with a higher risk of severe allergy (OR = 2.33, 95% CI 1.05-5.15, p = 0.04). This study identified significant associations between HLA alleles and an increased risk of cephalosporin allergy, which can aid in early detection and prediction of adverse drug reactions to cephalosporins. Furthermore, our study highlights the importance of HLA typing in drug safety and expanding our knowledge of drug hypersensitivity syndromes.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    单细胞转录组学(scRNA-seq)正在彻底改变生物学研究,然而,它面临着低效的成绩单捕获和噪音等挑战。为了应对这些挑战,使用邻居平均或图形扩散等方法。这些方法通常依赖于来自低维流形的k最近邻图。然而,scRNA-seq数据遭受“维度诅咒”,导致使用插补方法时数据过度平滑。为了克服这一点,SC-PHENIX采用PCA-UMAP扩散方法,这增强了数据结构的保存,并允许精确使用PCA维度和扩散参数(例如,k-最近的邻居,马尔可夫矩阵的幂运算)以最小化噪声引入。这种方法可以更准确地构造指数马尔可夫矩阵(细胞邻域图),超越像魔术这样的方法。SC-PHENIX显著减轻了过度平滑,通过各种scRNA-seq数据集验证,证明了细胞表型表现的改善。应用于多细胞肿瘤球体数据集,SC-PHENIX鉴定出已知的极端表型状态,展示其有效性。sc-PHENIX是开源的,可用于和修改。
    Single-cell transcriptomics (scRNA-seq) is revolutionizing biological research, yet it faces challenges such as inefficient transcript capture and noise. To address these challenges, methods like neighbor averaging or graph diffusion are used. These methods often rely on k-nearest neighbor graphs from low-dimensional manifolds. However, scRNA-seq data suffer from the \'curse of dimensionality\', leading to the over-smoothing of data when using imputation methods. To overcome this, sc-PHENIX employs a PCA-UMAP diffusion method, which enhances the preservation of data structures and allows for a refined use of PCA dimensions and diffusion parameters (e.g., k-nearest neighbors, exponentiation of the Markov matrix) to minimize noise introduction. This approach enables a more accurate construction of the exponentiated Markov matrix (cell neighborhood graph), surpassing methods like MAGIC. sc-PHENIX significantly mitigates over-smoothing, as validated through various scRNA-seq datasets, demonstrating improved cell phenotype representation. Applied to a multicellular tumor spheroid dataset, sc-PHENIX identified known extreme phenotype states, showcasing its effectiveness. sc-PHENIX is open-source and available for use and modification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    网络嵌入是一种通用的机器学习技术,它将网络数据从非欧几里得空间转换为欧几里得空间,促进网络的下游分析。然而,现有的嵌入方法通常是基于优化的,嵌入维度以启发式或临时方式确定,这可能会导致下游统计推断中的潜在偏差。此外,由于深度神经网络的普遍逼近能力,现有的深度嵌入方法可能会遇到不可辨识性问题。我们在严格的统计框架内解决这些问题。我们将嵌入向量视为缺失数据,使用稀疏解码器重建网络特征,并使用自适应随机梯度马尔可夫链蒙特卡罗(MCMC)算法同时计算嵌入向量并训练稀疏解码器。在温和的条件下,我们证明了稀疏解码器提供了从嵌入空间到网络特征的简约映射,能够有效选择嵌入维度,克服现有深度嵌入方法遇到的不可识别性问题。此外,我们证明了嵌入向量在2-Wasserstein距离上微弱地收敛到所需的后验分布,解决现有嵌入方法遇到的潜在偏见问题。这项工作为在缺失数据填补框架内的网络嵌入奠定了第一个理论基础。
    Network embedding is a general-purpose machine learning technique that converts network data from non-Euclidean space to Euclidean space, facilitating downstream analyses for the networks. However, existing embedding methods are often optimization-based, with the embedding dimension determined in a heuristic or ad hoc way, which can cause potential bias in downstream statistical inference. Additionally, existing deep embedding methods can suffer from a nonidentifiability issue due to the universal approximation power of deep neural networks. We address these issues within a rigorous statistical framework. We treat the embedding vectors as missing data, reconstruct the network features using a sparse decoder, and simultaneously impute the embedding vectors and train the sparse decoder using an adaptive stochastic gradient Markov chain Monte Carlo (MCMC) algorithm. Under mild conditions, we show that the sparse decoder provides a parsimonious mapping from the embedding space to network features, enabling effective selection of the embedding dimension and overcoming the nonidentifiability issue encountered by existing deep embedding methods. Furthermore, we show that the embedding vectors converge weakly to a desired posterior distribution in the 2-Wasserstein distance, addressing the potential bias issue experienced by existing embedding methods. This work lays down the first theoretical foundation for network embedding within the framework of missing data imputation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    张量分解是一种应用于多维数组的降维方法。这些方法可用于识别各种生物医学数据集中的模式,因为它们能够保留实验的组织结构,因此有助于产生有意义的见解。然而,正在分析的数据集中丢失的数据可能会带来挑战。可以在一定程度的缺失数据的情况下执行张量分解,并重建完整的张量。然而,虽然张量方法可能会推算这些缺失值,拟合算法的选择可能会影响这些插补的保真度。以前的方法,基于具有预填充值的交替最小二乘或直接优化,遭受引入的偏差或计算性能缓慢。在这项研究中,我们认为删失最小二乘法可以更好地处理张量形式的数据的缺失值。我们在四个不同的生物数据集上运行了截尾最小二乘,并将其性能与具有预填充值和直接优化的交替最小二乘进行了比较。我们使用了插补错误和推断掩蔽值的能力来衡量其缺失的数据性能。通过多项研究的准确性和收敛性指标,经审查的最小二乘法似乎最适合分析高维生物数据。
    Tensor factorization is a dimensionality reduction method applied to multidimensional arrays. These methods are useful for identifying patterns within a variety of biomedical datasets due to their ability to preserve the organizational structure of experiments and therefore aid in generating meaningful insights. However, missing data in the datasets being analyzed can impose challenges. Tensor factorization can be performed with some level of missing data and reconstruct a complete tensor. However, while tensor methods may impute these missing values, the choice of fitting algorithm may influence the fidelity of these imputations. Previous approaches, based on alternating least squares with prefilled values or direct optimization, suffer from introduced bias or slow computational performance. In this study, we propose that censored least squares can better handle missing values with data structured in tensor form. We ran censored least squares on four different biological datasets and compared its performance against alternating least squares with prefilled values and direct optimization. We used the error of imputation and the ability to infer masked values to benchmark their missing data performance. Censored least squares appeared best suited for the analysis of high-dimensional biological data by accuracy and convergence metrics across several studies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    肾结石疾病是一种广泛的泌尿系统疾病,影响全球数百万人。及时诊断对于避免严重并发症至关重要。传统上,使用计算机断层扫描(CT)检测肾结石,which,尽管它的有效性,是昂贵的,资源密集型,让病人暴露于不必要的辐射,并且通常由于放射学报告等待时间而导致延迟。这项研究提出了一种利用机器学习的新方法,利用常规实验室检测结果早期检测肾结石。我们利用了一个广泛的数据集,包括来自沙特阿拉伯医院的2156个患者记录,具有15个属性,具有数据缺失和类不平衡等挑战。我们评估了各种机器学习算法和插补方法,包括单一和多重归算,以及过采样和欠采样技术。我们的结果表明,基于集成树的分类器,特别是随机森林(RF)和额外的树分类器(ETree),以99%的显著准确率胜过其他人,召回率98%,RF的F1得分为99%,92%为ETree。这项研究强调了非侵入性,用于肾结石检测的具有成本效益的实验室检查,促进及时和改进的医疗支持。
    Kidney stone disease is a widespread urological disorder affecting millions globally. Timely diagnosis is crucial to avoid severe complications. Traditionally, renal stones are detected using computed tomography (CT), which, despite its effectiveness, is costly, resource-intensive, exposes patients to unnecessary radiation, and often results in delays due to radiology report wait times. This study presents a novel approach leveraging machine learning to detect renal stones early using routine laboratory test results. We utilized an extensive dataset comprising 2156 patient records from a Saudi Arabian hospital, featuring 15 attributes with challenges such as missing data and class imbalance. We evaluated various machine learning algorithms and imputation methods, including single and multiple imputations, as well as oversampling and undersampling techniques. Our results demonstrate that ensemble tree-based classifiers, specifically random forest (RF) and extra tree classifiers (ETree), outperform others with remarkable accuracy rates of 99%, recall rates of 98%, and F1 scores of 99% for RF, and 92% for ETree. This study underscores the potential of non-invasive, cost-effective laboratory tests for renal stone detection, promoting prompt and improved medical support.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    当前的全基因组单核苷酸多态性(SNP)基因分型方法会产生大量的缺失数据,这些数据可能会影响统计推断并偏向实验结果。基因型插补通常用于经过充分研究的物种,以缓冲下游分析的影响,和几种算法可用于填补缺失的基因型。缺乏参考单倍型组排除了在非模型生物的基因组研究中使用这些方法。作为替代,机器学习算法用于探索基因型数据并估计缺失的基因型。这里,我们提出了一种基于自组织映射(SOM)的插补方法,一种广泛使用的神经网络,由空间分布的神经元形成,将相似的输入聚集成接近的神经元。该方法探索基因型数据集以选择SNP基因座以从基因型中构建二元载体,并为每个查询缺失的SNP基因型初始化和训练神经网络。然后使用SOM衍生的聚类来估算最佳基因型。为了自动化估算过程,我们已经实施了gtImputation,一个用Python3编程的开源应用程序,并具有用户友好的GUI以促进整个过程。通过比较其准确性,验证了该方法的性能,使用其他可用的插补算法对几个基准基因型数据集的精度和灵敏度。我们的方法产生了高度准确和精确的基因型插补,即使对于具有低频率等位基因的SNP,优于其他算法,特别是对于来自具有无关个体的混合群体的数据集。
    Current methodologies of genome-wide single-nucleotide polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis, and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on self-organizing maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. The method explores genotype datasets to select SNP loci to build binary vectors from the genotypes, and initializes and trains neural networks for each query missing SNP genotype. The SOM-derived clustering is then used to impute the best genotype. To automate the imputation process, we have implemented gtImputation, an open-source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations even for SNPs with alleles at low frequency and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    生存分析的格局不断被彻底改变,以应对生物医学挑战,最近的统计挑战是审查协变量而不是结果。有许多有前途的策略来解决审查的协变量,包括加权,imputation,最大似然,和贝叶斯方法。尽管如此,这是一个比较新鲜的研究领域,与审查结果的领域不同(即,生存分析)或缺失协变量。在这次审查中,我们讨论了处理删失协变量时遇到的独特统计挑战,并对旨在解决这些挑战的现有方法进行了深入回顾.我们强调每种方法的相对优势和劣势,提供建议,帮助研究者查明处理数据中删失协变量的最佳方法。
    The landscape of survival analysis is constantly being revolutionized to answer biomedical challenges, most recently the statistical challenge of censored covariates rather than outcomes. There are many promising strategies to tackle censored covariates, including weighting, imputation, maximum likelihood, and Bayesian methods. Still, this is a relatively fresh area of research, different from the areas of censored outcomes (i.e., survival analysis) or missing covariates. In this review, we discuss the unique statistical challenges encountered when handling censored covariates and provide an in-depth review of existing methods designed to address those challenges. We emphasize each method\'s relative strengths and weaknesses, providing recommendations to help investigators pinpoint the best approach to handling censored covariates in their data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号