关键词: Disordered proteins language model molecular recognition features

Mesh : Algorithms Deep Learning Computational Biology / methods Intrinsically Disordered Proteins / chemistry metabolism Databases, Protein / statistics & numerical data

来  源:   DOI:10.1142/S0219720024500069

Abstract:
Molecular recognition features (MoRFs) are particular functional segments of disordered proteins, which play crucial roles in regulating the phase transition of membrane-less organelles and frequently serve as central sites in cellular interaction networks. As the association between disordered proteins and severe diseases continues to be discovered, identifying MoRFs has gained growing significance. Due to the limited number of experimentally validated MoRFs, the performance of existing MoRF\'s prediction algorithms is not good enough and still needs to be improved. In this research, we present a model named MoRF_ESM, which utilizes deep-learning protein representations to predict MoRFs in disordered proteins. This approach employs a pretrained ESM-2 protein language model to generate embedding representations of residues in the form of attention map matrices. These representations are combined with a self-learned TextCNN model for feature extraction and prediction. In addition, an averaging step was incorporated at the end of the MoRF_ESM model to refine the output and generate final prediction results. In comparison to other impressive methods on benchmark datasets, the MoRF_ESM approach demonstrates state-of-the-art performance, achieving [Formula: see text] higher AUC than other methods when tested on TEST1 and achieving [Formula: see text] higher AUC than other methods when tested on TEST2. These results imply that the combination of ESM-2 and TextCNN can effectively extract deep evolutionary features related to protein structure and function, along with capturing shallow pattern features located in protein sequences, and is well qualified for the prediction task of MoRFs. Given that ESM-2 is a highly versatile protein language model, the methodology proposed in this study can be readily applied to other tasks involving the classification of protein sequences.
摘要:
分子识别特征(MoRFs)是无序蛋白质的特定功能片段,在调节无膜细胞器的相变中起着至关重要的作用,并且经常充当细胞相互作用网络的中心位点。随着无序蛋白质和严重疾病之间的联系不断被发现,识别MoRFs已经变得越来越重要。由于实验验证的MoRF数量有限,现有MoRF预测算法的性能不够好,仍需改进。在这项研究中,我们提出了一个名为MoRF_ESM的模型,它利用深度学习蛋白质表示来预测无序蛋白质中的MoRFs。该方法采用预训练的ESM-2蛋白质语言模型来生成注意力图矩阵形式的残基的嵌入表示。这些表示与自学习的TextCNN模型相结合,用于特征提取和预测。此外,在MoRF_ESM模型的末尾加入了平均步骤,以细化输出并生成最终预测结果。与基准数据集上其他令人印象深刻的方法相比,MoRF_ESM方法展示了最先进的性能,当在TEST1上测试时,实现[公式:参见文本]比其他方法更高的AUC,并且当在TEST2上测试时,实现[公式:参见文本]比其他方法更高的AUC。这些结果表明,ESM-2和TextCNN的组合可以有效地提取与蛋白质结构和功能相关的深层进化特征,同时捕获位于蛋白质序列中的浅层模式特征,并且很好地胜任了MoRFs的预测任务。鉴于ESM-2是一种高度通用的蛋白质语言模型,本研究中提出的方法可以很容易地应用于涉及蛋白质序列分类的其他任务。
公众号