关键词: CNN Lightgbm SHAP (shapley additive explanation) limma maize tissue specific genes

来  源:   DOI:10.3389/fgene.2023.1190887   PDF(Pubmed)

Abstract:
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants. Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes. Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant. Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.
摘要:
简介:随着RNA-seq技术和机器学习的进步,用机器学习模型训练来自数据库的大规模RNA-seq数据通常可以识别以前被标准线性分析方法错过的具有重要调节作用的基因。发现组织特异性基因可以提高我们对组织和基因之间关系的理解。然而,一些用于转录组数据的机器学习模型已经被部署和比较,以识别组织特异性基因,特别是对于植物。方法:在本研究中,用线性模型(Limma)处理表达式矩阵,机器学习模型(LightGBM),以及具有信息增益的深度学习模型(CNN)和基于从公共数据库获得的1,548个玉米多组织RNA-seq数据的SHAP策略,以识别组织特异性基因。在验证方面,基于基因集的k-均值聚类计算V-测量值以评估其技术互补性。此外,采用GO分析和文献检索的方法验证了这些基因的功能和研究现状。结果:基于聚类验证,卷积神经网络优于其他具有较高的V度量值0.647,表明其基因集可以覆盖尽可能多的各种组织的特定属性,而LightGBM发现了关键转录因子。三个基因集的组合产生了78个核心组织特异性基因,这些基因先前已在文献中显示具有生物学意义。讨论:由于机器学习模型的不同解释策略,确定了不同的组织特异性基因集,研究人员可以根据目标使用多种方法和策略来处理组织特异性基因集。数据类型,和计算资源。本研究为转录组数据集的大规模数据挖掘提供了比较见解,解决生物信息学数据处理中的高维和偏倚困难。
公众号