关键词: Deep learning Model interpretability Promoter Representation learning

Mesh : Promoter Regions, Genetic Neural Networks, Computer Humans Animals Computational Biology / methods Sequence Analysis, DNA / methods Mice Software

来  源:   DOI:10.1016/j.compbiomed.2024.108974

Abstract:
Promoters are DNA sequences that bind with RNA polymerase to initiate transcription, regulating this process through interactions with transcription factors. Accurate identification of promoters is crucial for understanding gene expression regulation mechanisms and developing therapeutic approaches for various diseases. However, experimental techniques for promoter identification are often expensive, time-consuming, and inefficient, necessitating the development of accurate and efficient computational models for this task. Enhancing the model\'s ability to recognize promoters across multiple species and improving its interpretability pose significant challenges. In this study, we introduce a novel interpretable model based on graph neural networks, named GraphPro, for multi-species promoter identification. Initially, we encode the sequences using k-tuple nucleotide frequency pattern, dinucleotide physicochemical properties, and dna2vec. Subsequently, we construct two feature extraction modules based on convolutional neural networks and graph neural networks. These modules aim to extract specific motifs from the promoters, learn their dependencies, and capture the underlying structural features of the promoters, providing a more comprehensive representation. Finally, a fully connected neural network predicts whether the input sequence is a promoter. We conducted extensive experiments on promoter datasets from eight species, including Human, Mouse, and Escherichia coli. The experimental results show that the average Sn, Sp, Acc and MCC values of GraphPro are 0.9123, 0.9482, 0.8840 and 0.7984, respectively. Compared with previous promoter identification methods, GraphPro not only achieves better recognition accuracy on multiple species, but also outperforms all previous methods in cross-species prediction ability. Furthermore, by visualizing GraphPro\'s decision process and analyzing the sequences matching the transcription factor binding motifs captured by the model, we validate its significant advantages in biological interpretability. The source code for GraphPro is available at https://github.com/liuliwei1980/GraphPro.
摘要:
启动子是与RNA聚合酶结合以启动转录的DNA序列,通过与转录因子的相互作用来调节这一过程。准确鉴定启动子对于理解基因表达调控机制和开发各种疾病的治疗方法至关重要。然而,启动子鉴定的实验技术通常很昂贵,耗时,效率低下,需要为此任务开发准确有效的计算模型。增强模型识别跨多个物种的启动子的能力并提高其可解释性构成重大挑战。在这项研究中,我们介绍了一种新的基于图神经网络的可解释模型,名为GraphPro,用于多物种启动子鉴定。最初,我们使用k元组核苷酸频率模式编码序列,二核苷酸理化性质,dna2vec随后,我们构建了两个基于卷积神经网络和图神经网络的特征提取模块。这些模块旨在从启动子中提取特定的基序,学习他们的依赖,并捕获启动子的潜在结构特征,提供更全面的代表性。最后,完全连接的神经网络预测输入序列是否是启动子。我们对来自八个物种的启动子数据集进行了广泛的实验,包括人类,老鼠,和大肠杆菌。实验结果表明,平均Sn,SP,GraphPro的Acc和MCC值分别为0.9123、0.9482、0.8840和0.7984。与以前的启动子鉴定方法相比,GraphPro不仅在多个物种上实现了更好的识别精度,但在跨物种预测能力方面也优于所有以前的方法。此外,通过可视化GraphPro的决策过程并分析与模型捕获的转录因子结合基序相匹配的序列,我们验证了其在生物学可解释性方面的显著优势。GraphPro的源代码可在https://github.com/liuliwei1980/GraphPro获得。
公众号