关键词: Clustering Dimensionality reduction Machine learning Persistent Laplacian Persistent homology Topology scRNA-seq

Mesh : Single-Cell Analysis / methods Principal Component Analysis Humans Sequence Analysis, RNA / methods Algorithms RNA-Seq / methods

来  源:   DOI:10.1016/j.compbiomed.2024.108497   PDF(Pubmed)

Abstract:
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.
摘要:
单细胞RNA测序(scRNA-seq)被广泛用于揭示细胞中的异质性,这给了我们对细胞间通信的见解,细胞分化,和差异基因表达。然而,由于稀疏性和涉及的大量基因,分析scRNA-seq数据是一个挑战。因此,降维和特征选择对于消除杂散信号和增强下游分析很重要。传统的PCA,降维的主要主力,缺乏捕获嵌入数据中的几何结构信息的能力,和以前的图拉普拉斯正则化仅受到单一尺度分析的限制。我们通过将持久拉普拉斯(PL)技术和L2,1范数正则化相结合,提出了一种拓扑主成分分析(tPCA)方法,以解决数据中的多尺度和多类异质性问题。我们进一步引入k-最近邻(kNN)持久拉普拉斯技术来提高我们的持久拉普拉斯方法的鲁棒性。提出的kNN-PL是一种新的代数拓扑技术,它解决了传统持久同源性的许多局限性。不是通过改变距离阈值来诱导过滤,我们引入了kNN-tPCA,通过在每个步骤中改变kNN网络中邻居的数量来实现过滤,并发现该框架对超参数调整具有重要意义。我们在11个不同的基准scRNA-seq数据集上验证了我们提出的tPCA和kNN-tPCA方法的有效性,并展示了我们的方法优于文献中的其他无监督PCA增强,以及流行的统一流形近似(UMAP),t分布随机邻居嵌入(tSNE),和投影非负矩阵分解(NMF)的显著边际。例如,tPCA提供高达628%,78%,和149%的改进UMAP,tSNE,和NMF,分别在F1度量中进行分类,kNN-tPCA提供53%,63%,对UMAP进行了32%的改进,tSNE,和NMF,分别在ARI度量中的聚类上。
公众号