关键词: adaptive adjustment gene expression data gravitational search algorithm label confidence self-training subspace clustering

来  源:   DOI:10.3389/fgene.2023.1132370   PDF(Pubmed)

Abstract:
Gene clustering is one of the important techniques to identify co-expressed gene groups from gene expression data, which provides a powerful tool for investigating functional relationships of genes in biological process. Self-training is a kind of important semi-supervised learning method and has exhibited good performance on gene clustering problem. However, the self-training process inevitably suffers from mislabeling, the accumulation of which will lead to the degradation of semi-supervised learning performance of gene expression data. To solve the problem, this paper proposes a self-training subspace clustering algorithm based on adaptive confidence for gene expression data (SSCAC), which combines the low-rank representation of gene expression data and adaptive adjustment of label confidence to better guide the partition of unlabeled data. The superiority of the proposed SSCAC algorithm is mainly reflected in the following aspects. 1) In order to improve the discriminative property of gene expression data, the low-rank representation with distance penalty is used to mine the potential subspace structure of data. 2) Considering the problem of mislabeling in self-training, a semi-supervised clustering objective function with label confidence is proposed, and a self-training subspace clustering framework is constructed on this basis. 3) In order to mitigate the negative impact of mislabeled data, an adaptive adjustment strategy based on gravitational search algorithm is proposed for label confidence. Compared with a variety of state-of-the-art unsupervised and semi-supervised learning algorithms, the SSCAC algorithm has demonstrated its superiority through extensive experiments on two benchmark gene expression datasets.
摘要:
基因聚类是从基因表达数据中识别共表达基因群的重要技术之一,这为研究基因在生物过程中的功能关系提供了有力的工具。自我训练是一种重要的半监督学习方法,在基因聚类问题上表现出良好的性能。然而,自我训练过程不可避免地会受到错误标签的影响,的积累将导致基因表达数据半监督学习性能的退化。为了解决问题,本文提出了一种基于自适应置信度的基因表达数据自训练子空间聚类算法(SSCAC),结合基因表达数据的低秩表示和标签置信度的自适应调整,以更好地指导未标记数据的划分。提出的SSCAC算法的优越性主要体现在以下几个方面。1)为了提高基因表达数据的判别性,利用带距离惩罚的低秩表示来挖掘数据的潜在子空间结构。2)考虑到自我训练中贴错标签的问题,提出了具有标签置信度的半监督聚类目标函数,在此基础上构建了自训练子空间聚类框架。3)为了减轻错误标记数据的负面影响,提出了一种基于引力搜索算法的标签置信度自适应调整策略。与各种最先进的无监督和半监督学习算法相比,SSCAC算法通过在两个基准基因表达数据集上的大量实验证明了其优越性。
公众号