关键词: DFA EHO Firefly GMM Flower pollination optimization with GMM GMM LASSO Lung cancer Microarray gene expression NBC PSO GMM STFT SVM

Mesh : Humans Colonic Neoplasms / genetics Machine Learning Gene Expression Profiling / methods Support Vector Machine Algorithms Oligonucleotide Array Sequence Analysis / methods Bayes Theorem Gene Expression Regulation, Neoplastic Lung Neoplasms / genetics classification Fourier Analysis

来  源:   DOI:10.1038/s41598-024-67135-1   PDF(Pubmed)

Abstract:
The microarray gene expression data poses a tremendous challenge due to their curse of dimensionality problem. The sheer volume of features far surpasses available samples, leading to overfitting and reduced classification accuracy. Thus the dimensionality of microarray gene expression data must be reduced with efficient feature extraction methods to reduce the volume of data and extract meaningful information to enhance the classification accuracy and interpretability. In this research, we discover the uniqueness of applying STFT (Short Term Fourier Transform), LASSO (Least Absolute Shrinkage and Selection Operator), and EHO (Elephant Herding Optimisation) for extracting significant features from lung cancer and reducing the dimensionality of the microarray gene expression database. The classification of lung cancer is performed using the following classifiers: Gaussian Mixture Model (GMM), Particle Swarm Optimization (PSO) with GMM, Detrended Fluctuation Analysis (DFA), Naive Bayes classifier (NBC), Firefly with GMM, Support Vector Machine with Radial Basis Kernel (SVM-RBF) and Flower Pollination Optimization (FPO) with GMM. The EHO feature extraction with the FPO-GMM classifier attained the highest accuracy in the range of 96.77, with an F1 score of 97.5, MCC of 0.92 and Kappa of 0.92. The reported results underline the significance of utilizing STFT, LASSO, and EHO for feature extraction in reducing the dimensionality of microarray gene expression data. These methodologies also help in improved and early diagnosis of lung cancer with enhanced classification accuracy and interpretability.
摘要:
微阵列基因表达数据由于其维度问题的诅咒而提出了巨大的挑战。功能的绝对数量远远超过可用的样品,导致过拟合和降低分类精度。因此,必须通过有效的特征提取方法来降低微阵列基因表达数据的维数,以减少数据量并提取有意义的信息,以提高分类准确性和可解释性。在这项研究中,我们发现了应用STFT(短期傅里叶变换)的唯一性,LASSO(最小绝对收缩和选择运算符),和EHO(大象放群优化),用于从肺癌中提取重要特征并降低微阵列基因表达数据库的维度。肺癌的分类使用以下分类器进行:高斯混合模型(GMM),基于GMM的粒子群优化算法(PSO),去趋势波动分析(DFA)朴素贝叶斯分类器(NBC),带GMM的萤火虫,径向基核支持向量机(SVM-RBF)和基于GMM的花授粉优化(FPO).使用FPO-GMM分类器的EHO特征提取在96.77的范围内获得了最高的准确性,F1得分为97.5,MCC为0.92,Kappa为0.92。报告的结果强调了利用STFT的重要性,拉索,和EHO用于特征提取,以降低微阵列基因表达数据的维数。这些方法还有助于改善和早期诊断肺癌,并提高分类准确性和可解释性。
公众号