Imputation

Imputation
  • 文章类型: Journal Article
    低覆盖率全基因组测序(LCS)为st鱼育种提供了一种具有成本效益的替代方法,特别是考虑到缺乏SNP芯片和与全基因组测序相关的高成本。在这项研究中,在643只测序的俄罗斯st鱼(〜13.68×)中评估了LCS基因型填补和基因组预测的效率。结果表明,使用BaseVarSTITCH在2倍的测序深度,样本量大于300导致最高的基因分型准确性。此外,当测序深度达到0.5倍,SNP密度通过连锁不平衡修剪降低到50K时,预测精度与整个测序深度相当.此外,增量特征选择方法有可能提高预测精度。这项研究表明,LCS和归因的结合可以是一种具有成本效益的策略,有助于经济性状的遗传改善和促进水产养殖物种的遗传增益。
    Low-coverage whole-genome sequencing (LCS) offers a cost-effective alternative for sturgeon breeding, especially given the lack of SNP chips and the high costs associated with whole-genome sequencing. In this study, the efficiency of LCS for genotype imputation and genomic prediction was assessed in 643 sequenced Russian sturgeons (∼13.68×). The results showed that using BaseVar+STITCH at a sequencing depth of 2× with a sample size larger than 300 resulted in the highest genotyping accuracy. In addition, when the sequencing depth reached 0.5× and SNP density was reduced to 50 K through linkage disequilibrium pruning, the prediction accuracy was comparable to that of whole sequencing depth. Furthermore, an incremental feature selection method has the potential to improve prediction accuracy. This study suggests that the combination of LCS and imputation can be a cost-effective strategy, contributing to the genetic improvement of economic traits and promoting genetic gains in aquaculture species.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    多变量时间序列数据中缺失值的填补是一种基础和流行的数据处理技术。最近,一些研究利用递归神经网络(RNN)和生成对抗网络(GAN)来估算/填充多变量时间序列数据中的缺失值。然而,当面对高缺失率的数据集时,这些方法的归责误差急剧增加。为此,我们提出了一种基于动态贡献和注意力的神经网络模型,表示为ContrattNet。ContrattNet由三个新颖的模块组成:功能注意模块,iLSTM(插补长短期记忆)模块,和1D-CNN(一维卷积神经网络)模块。ContrAttNet利用时间信息和空间特征信息来预测缺失值,其中,iLSTM根据缺失值的特征衰减LSTM的内存,学习不同功能的贡献。此外,特征注意力模块引入了基于贡献的注意力机制,计算监督权重。此外,在这些监督权重的影响下,1D-CNN通过将时间序列数据视为空间特征来处理它们。实验结果表明,ContrattNet在多变量时间序列数据的缺失值填补方面优于其他最新模型,基准数据集上的平均MAPE为6%,MAE为9%。
    The imputation of missing values in multivariate time-series data is a basic and popular data processing technology. Recently, some studies have exploited Recurrent Neural Networks (RNNs) and Generative Adversarial Networks (GANs) to impute/fill the missing values in multivariate time-series data. However, when faced with datasets with high missing rates, the imputation error of these methods increases dramatically. To this end, we propose a neural network model based on dynamic contribution and attention, denoted as ContrAttNet. ContrAttNet consists of three novel modules: feature attention module, iLSTM (imputation Long Short-Term Memory) module, and 1D-CNN (1-Dimensional Convolutional Neural Network) module. ContrAttNet exploits temporal information and spatial feature information to predict missing values, where iLSTM attenuates the memory of LSTM according to the characteristics of the missing values, to learn the contributions of different features. Moreover, the feature attention module introduces an attention mechanism based on contributions, to calculate supervised weights. Furthermore, under the influence of these supervised weights, 1D-CNN processes the time-series data by treating them as spatial features. Experimental results show that ContrAttNet outperforms other state-of-the-art models in the missing value imputation of multivariate time-series data, with average 6% MAPE and 9% MAE on the benchmark datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于种种原因,例如数据收集的限制和网络传输的中断,收集的数据通常包含缺失值。现有的最先进的生成对抗插补方法面临三个主要问题:适用性有限,忽略了可以反映样本之间关系的潜在分类信息,无法平衡本地和全球信息。我们提出了一种名为DTAE-CGAN的新型生成对抗模型,该模型结合了脱轨自动编码和条件标签来解决这些问题。这增强了网络学习样本间相关性的能力,并充分利用了不完整数据集中的所有数据信息,而不是学习随机噪声。我们在六个不同大小的真实数据集上进行了实验,将我们的方法与四个经典的归责基线进行比较。结果表明,我们提出的模型始终表现出优异的归因精度。
    Due to various reasons, such as limitations in data collection and interruptions in network transmission, gathered data often contain missing values. Existing state-of-the-art generative adversarial imputation methods face three main issues: limited applicability, neglect of latent categorical information that could reflect relationships among samples, and an inability to balance local and global information. We propose a novel generative adversarial model named DTAE-CGAN that incorporates detracking autoencoding and conditional labels to address these issues. This enhances the network\'s ability to learn inter-sample correlations and makes full use of all data information in incomplete datasets, rather than learning random noise. We conducted experiments on six real datasets of varying sizes, comparing our method with four classic imputation baselines. The results demonstrate that our proposed model consistently exhibited superior imputation accuracy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于基因组覆盖率低,古代基因组分析通常仅限于利用假单倍体数据。通过插补利用低覆盖率数据来计算分阶段的二倍体基因型,从而在未测序的位置实现基于单倍型的询问和SNP调用是非常理想的。尽管这些是考古学的令人信服的主题,但尚未对古代牛基因组进行调查,进化和经济原因。在这里,我们通过对中石器时代的欧洲aurochs(18.49x;9852-9376calBCE)进行测序来测试这种方法,欧洲中世纪早期母牛(18.69x;427-580calCE),并将这些与已出版的个人结合起来;两个古代和三个现代。我们对这些基因组进行下采样(0.25倍,0.5x,1.0x,2.0x)和估计二倍体基因型,利用171个已发表的现代牛基因组的参考小组,我们策划了2170万(Mn)阶段的单核苷酸多态性(SNP)。我们在变异位点处恢复高密度的正确调用,准确率>99.1%,最低采样深度为0.25x,对于2.0倍,增加到>99.5%(仅转换,次要等位基因频率(MAF)≥2.5%)。SNP的恢复与覆盖率相关,平均58%的网站以0.25倍的速度恢复,增加到2.0倍的87%,利用平均350万(Mn)转化(MAF≥2.5%),即使在aurochs,尽管与现代参考面板的时间距离最高。我们估算的基因组的行为类似于基于等位基因频率的分析中直接称为数据;例如,一致地识别纯合性>2mb的运行,包括中石器时代欧洲aurochs中的一个长纯合区域。
    Ancient genomic analyses are often restricted to utilizing pseudohaploid data due to low genome coverage. Leveraging low-coverage data by imputation to calculate phased diploid genotypes that enables haplotype-based interrogation and single nucleotide polymorphism (SNP) calling at unsequenced positions is highly desirable. This has not been investigated for ancient cattle genomes despite these being compelling subjects for archeological, evolutionary, and economic reasons. Here, we test this approach by sequencing a Mesolithic European aurochs (18.49×; 9,852 to 9,376 calBCE) and an Early Medieval European cow (18.69×; 427 to 580 calCE) and combine these with published individuals: two ancient and three modern. We downsample these genomes (0.25×, 0.5×, 1.0×, and 2.0×) and impute diploid genotypes, utilizing a reference panel of 171 published modern cattle genomes that we curated for 21.7 million (Mn) phased SNPs. We recover high densities of correct calls with an accuracy of >99.1% at variant sites for the lowest downsample depth of 0.25×, increasing to >99.5% for 2.0× (transversions only, minor allele frequency [MAF] ≥ 2.5%). The recovery of SNPs correlates with coverage; on average, 58% of sites are recovered for 0.25× increasing to 87% for 2.0×, utilizing an average of 3.5 million (Mn) transversions (MAF ≥2.5%), even in the aurochs, despite the highest temporal distance from the modern reference panel. Our imputed genomes behave similarly to directly called data in allele frequency-based analyses, for example consistently identifying runs of homozygosity >2 Mb, including a long homozygous region in the Mesolithic European aurochs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    空间转录组学(ST)已成为探索组织中基因表达的空间组织的有力工具。基于成像的方法,虽然在单细胞层面提供了优越的空间分辨率,在成像基因的数量或基因检测的灵敏度方面受到限制。用于增强ST的现有方法依赖于ST细胞与参考单细胞RNA测序(scRNA-seq)细胞之间的相似性。相比之下,我们引入stDiff,利用scRNA-seq数据中基因表达丰度之间的关系来增强ST。stDiff采用条件扩散模型,通过两个马尔可夫过程捕获scRNA-seq数据中的基因表达丰度关系:一个将噪声引入转录组学数据,另一个去噪以恢复它们。通过将原始ST数据合并到去噪过程中来预测ST的缺失部分。在我们对16个数据集的综合绩效评估中,利用多个聚类和相似性度量,stDiff以其在细胞之间保持拓扑结构的卓越能力而脱颖而出,将自己定位为细胞群识别的强大解决方案。此外,stDiff的增强结果与批处理空间中的实际ST数据非常相似。在不同的空间表达模式中,我们的模型准确地重建了它们,描绘不同的空间边界。这突出了stDiff将ST数据的观察和预测段统一起来以供后续分析的能力。我们预计标准,凭借其创新的方法,将有助于推进ST段插补方法。
    Spatial transcriptomics (ST) has become a powerful tool for exploring the spatial organization of gene expression in tissues. Imaging-based methods, though offering superior spatial resolutions at the single-cell level, are limited in either the number of imaged genes or the sensitivity of gene detection. Existing approaches for enhancing ST rely on the similarity between ST cells and reference single-cell RNA sequencing (scRNA-seq) cells. In contrast, we introduce stDiff, which leverages relationships between gene expression abundance in scRNA-seq data to enhance ST. stDiff employs a conditional diffusion model, capturing gene expression abundance relationships in scRNA-seq data through two Markov processes: one introducing noise to transcriptomics data and the other denoising to recover them. The missing portion of ST is predicted by incorporating the original ST data into the denoising process. In our comprehensive performance evaluation across 16 datasets, utilizing multiple clustering and similarity metrics, stDiff stands out for its exceptional ability to preserve topological structures among cells, positioning itself as a robust solution for cell population identification. Moreover, stDiff\'s enhancement outcomes closely mirror the actual ST data within the batch space. Across diverse spatial expression patterns, our model accurately reconstructs them, delineating distinct spatial boundaries. This highlights stDiff\'s capability to unify the observed and predicted segments of ST data for subsequent analysis. We anticipate that stDiff, with its innovative approach, will contribute to advancing ST imputation methodologies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    单细胞RNA测序(scRNA-seq)有助于研究细胞类型异质性和构建细胞图谱。然而,由于其局限性,许多基因可以被检测到具有零表达,即退出事件,导致下游分析的偏差,并阻碍细胞类型和细胞功能的识别和表征。尽管已经开发了许多估算方法,在不同类型和维度的数据和应用场景中,它们的性能通常低于预期。因此,开发一个准确和强大的单细胞基因表达数据插补方法仍然是必不可少的。考虑到保持原始的细胞-细胞和基因-基因相关性,并利用批量RNA测序(bulkRNA-seq)数据信息,我们提议scinrb,具有网络正则化和大量RNA-seq数据的单细胞基因表达插补方法。scINRB采用网络正则化的非负矩阵分解,以确保估算的数据保持细胞-细胞和基因-基因的相似性,并且接近从批量RNA-seq数据计算的基因平均表达。为了评估性能,我们在模拟和实验数据集上测试scINRB,并将其与其他常用的插补方法进行比较。结果表明,即使在高辍学率和高维度的情况下,scINRB也能准确恢复基因表达,保留细胞-细胞和基因-基因相似性,并改善各种下游分析,包括可视化,聚类和轨迹推断。
    Single-cell RNA sequencing (scRNA-seq) facilitates the study of cell type heterogeneity and the construction of cell atlas. However, due to its limitations, many genes may be detected to have zero expressions, i.e. dropout events, leading to bias in downstream analyses and hindering the identification and characterization of cell types and cell functions. Although many imputation methods have been developed, their performances are generally lower than expected across different kinds and dimensions of data and application scenarios. Therefore, developing an accurate and robust single-cell gene expression data imputation method is still essential. Considering to maintain the original cell-cell and gene-gene correlations and leverage bulk RNA sequencing (bulk RNA-seq) data information, we propose scINRB, a single-cell gene expression imputation method with network regularization and bulk RNA-seq data. scINRB adopts network-regularized non-negative matrix factorization to ensure that the imputed data maintains the cell-cell and gene-gene similarities and also approaches the gene average expression calculated from bulk RNA-seq data. To evaluate the performance, we test scINRB on simulated and experimental datasets and compare it with other commonly used imputation methods. The results show that scINRB recovers gene expression accurately even in the case of high dropout rates and dimensions, preserves cell-cell and gene-gene similarities and improves various downstream analyses including visualization, clustering and trajectory inference.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着工程技术的进步,地下盾构隧道工程也已开始纳入新兴技术,以监测盾构隧道施工和运营阶段的力和位移。安装在隧道段组件上的监控设备会产生大量数据。然而,由于各种因素,数据可能丢失。因此,为了确保工程项目的最大安全,必须完成不完整的数据。在这项研究中,介绍了一种利用随机森林(RF)的缺失数据填补技术。决策树数量的最优组合,最大深度,通过最小化均方误差(MSE)来确定RF中的特征的数量。随后,人工操纵完整的土壤压力数据,以创建缺失率为20%的不完整数据集,40%,和60%。使用三种方法对填补结果进行比较分析-中位数,意思是,和RF-揭示了该方法具有最小的填补误差。随着失踪率的增加,随机森林方法和其他两种方法的均方误差也增加了,最大差异约为70%。这表明随机森林方法适用于监测数据的估算。
    With the advancement of engineering techniques, underground shield tunneling projects have also started incorporating emerging technologies to monitor the forces and displacements during the construction and operation phases of shield tunnels. Monitoring devices installed on the tunnel segment components generate a large amount of data. However, due to various factors, data may be missing. Hence, the completion of the incomplete data is imperative to ensure the utmost safety of the engineering project. In this research, a missing data imputation technique utilizing Random Forest (RF) is introduced. The optimal combination of the number of decision trees, maximum depth, and number of features in the RF is determined by minimizing the Mean Squared Error (MSE). Subsequently, complete soil pressure data are artificially manipulated to create incomplete datasets with missing rates of 20%, 40%, and 60%. A comparative analysis of the imputation results using three methods-median, mean, and RF-reveals that this proposed method has the smallest imputation error. As the missing rate increases, the mean squared error of the Random Forest method and the other two methods also increases, with a maximum difference of about 70%. This indicates that the random forest method is suitable for imputing monitoring data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:单细胞RNA测序(scRNA-seq)为探索细胞异质性提供了强大的工具,发现新颖或稀有的细胞类型,区分组织特异性细胞组成,了解发育过程中的细胞分化。然而,由于技术限制,scRNA-seq中的dropout事件可能会错误地将真实数据中的某些条目转换为零。这相当于将噪声引入细胞基因表达条目的数据中。数据被污染了,影响下游分析的性能,包括聚类,单元格注释,差异基因表达分析,等等。因此,准确确定哪些零是由于dropout事件引起的,并对其执行插补操作是一项至关重要的工作。
    方法:考虑到基因表达矩阵中不同零的不同置信水平,本文提出了一种基于加权交替最小二乘法(WALS)的scRNA-seq中dropout事件的SinCWIm方法。该方法利用皮尔逊相关系数和层次聚类来量化零条目的置信度。然后与WALS结合进行矩阵分解。并且通过离群值去除和数据校正操作使估算结果接近实际数量。
    结果:总共使用八个单细胞测序数据集进行比较实验,以证明SinCWIm优于最先进的模型。应用SinCWIm对数据进行聚类,以获得调整后的RAND指数评估,和乌索斯金,花粉和膀胱数据集得分94.46%,96.48%和76.74%,分别。此外,在差异表达基因的保留和可视化方面取得了显着改善。
    结论:SinCWIm为处理单细胞测序数据中的丢失事件提供了一种有价值的归因方法。与先进的方法相比,SinCWIm在集群中展示了出色的性能,可视化和其他方面。它适用于各种单细胞测序数据集。
    OBJECTIVE: Single-cell RNA sequencing (scRNA-seq) provides a powerful tool for exploring cellular heterogeneity, discovering novel or rare cell types, distinguishing between tissue-specific cellular composition, and understanding cell differentiation during development. However, due to technological limitations, dropout events in scRNA-seq can mistakenly convert some entries in the real data to zero. This is equivalent to introducing noise into the data of cell gene expression entries. The data is contaminated, which affects the performance of downstream analyses, including clustering, cell annotation, differential gene expression analysis, and so on. Therefore, it is a crucial work to accurately determine which zeros are due to dropout events and perform imputation operations on them.
    METHODS: Considering the different confidence levels of different zeros in the gene expression matrix, this paper proposes a SinCWIm method for dropout events in scRNA-seq based on weighted alternating least squares (WALS). The method utilizes Pearson correlation coefficient and hierarchical clustering to quantify the confidence of zero entries. It is then combined with WALS for matrix decomposition. And the imputation result is made close to the actual number by outlier removal and data correction operations.
    RESULTS: A total of eight single-cell sequencing datasets were used for comparative experiments to demonstrate the overall superiority of SinCWIm over state-of-the-art models. SinCWIm was applied to cluster the data to obtain an adjusted RAND index evaluation, and the Usoskin, Pollen and Bladder datasets scored 94.46%, 96.48% and 76.74%, respectively. In addition, significant improvements were made in the retention of differential expression genes and visualization.
    CONCLUSIONS: SinCWIm provides a valuable imputation method for handling dropout events in single-cell sequencing data. In comparison to advanced methods, SinCWIm demonstrates excellent performance in clustering, visualization and other aspects. It is applicable to various single-cell sequencing datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    单细胞RNA测序(scRNA-seq)是一种在单细胞水平上研究基因表达的强大方法。但是精确定量遗传物质通常受到有限的mRNA捕获的阻碍,导致许多缺少表达式值。现有的插补方法依赖于严格的数据假设,限制了它们更广泛的应用,缺乏可靠的监督,导致信号恢复有偏差。为了应对这些挑战,作者开发了Bis,一种分布不可知的深度学习模型,用于从多个平台准确恢复缺失的sing-cell基因表达。Bis是一种基于传输的最佳自动编码器模型,可以捕获scRNA-seq数据的复杂分布,同时通过正则化细胞嵌入空间来解决特征稀疏性。此外,他们提出了一个使用大量RNA-seq数据来指导重建并确保表达一致性的模块。实验结果表明,Bis在模拟和真实数据集上的表现优于其他模型,在各种下游分析中展示优势,包括批量效应去除,聚类,差异表达分析,和轨迹推断。此外,Bis成功恢复了肿瘤匹配外周血数据集中稀有细胞亚群的基因表达水平,揭示了头颈部鳞状细胞癌微环境中细胞因子诱导的自然杀伤细胞的发育特征。
    Single-cell RNA sequencing (scRNA-seq) is a robust method for studying gene expression at the single-cell level, but accurately quantifying genetic material is often hindered by limited mRNA capture, resulting in many missing expression values. Existing imputation methods rely on strict data assumptions, limiting their broader application, and lack reliable supervision, leading to biased signal recovery. To address these challenges, authors developed Bis, a distribution-agnostic deep learning model for accurately recovering missing sing-cell gene expression from multiple platforms. Bis is an optimal transport-based autoencoder model that can capture the intricate distribution of scRNA-seq data while addressing the characteristic sparsity by regularizing the cellular embedding space. Additionally, they propose a module using bulk RNA-seq data to guide reconstruction and ensure expression consistency. Experimental results show Bis outperforms other models across simulated and real datasets, showcasing superiority in various downstream analyses including batch effect removal, clustering, differential expression analysis, and trajectory inference. Moreover, Bis successfully restores gene expression levels in rare cell subsets in a tumor-matched peripheral blood dataset, revealing developmental characteristics of cytokine-induced natural killer cells within a head and neck squamous cell carcinoma microenvironment.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:高分辨空间转录组学(ST)的出现促进了研究生物学发育的新方法的研究,有机体生长,和其他复杂的生物过程。然而,高分辨率和完整的转录组学ST数据集需要定制的插补方法来提高信噪比和数据质量。
    结果:我们提出了一种用于高分辨ST的高效且自适应的高斯平滑(EAGS)插补方法。EAGS的自适应2因子平滑基于细胞的空间和表达信息创建模式,为同一模式中的单元格的平滑创建自适应权重,然后利用权重来恢复基因表达谱。我们使用小鼠大脑和嗅球的模拟和高分辨率ST数据集评估了EAGS的性能和效率。
    结论:与其他竞争方法相比,EAGS显示出更高的聚类精度,更好的生物学解释,并显著减少计算消耗。
    The emergence of high-resolved spatial transcriptomics (ST) has facilitated the research of novel methods to investigate biological development, organism growth, and other complex biological processes. However, high-resolved and whole transcriptomics ST datasets require customized imputation methods to improve the signal-to-noise ratio and the data quality.
    We propose an efficient and adaptive Gaussian smoothing (EAGS) imputation method for high-resolved ST. The adaptive 2-factor smoothing of EAGS creates patterns based on the spatial and expression information of the cells, creates adaptive weights for the smoothing of cells in the same pattern, and then utilizes the weights to restore the gene expression profiles. We assessed the performance and efficiency of EAGS using simulated and high-resolved ST datasets of mouse brain and olfactory bulb.
    Compared with other competitive methods, EAGS shows higher clustering accuracy, better biological interpretations, and significantly reduced computational consumption.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号