关键词: Bioinformatics Extreme gradient boosting O-linked threonine glycosylation Post-translational modification Pretrained protein language model-based features Two-step feature selection

Mesh : Glycosylation Humans Threonine / metabolism chemistry Protein Processing, Post-Translational Software Computational Biology / methods Databases, Protein Proteins / chemistry metabolism

来  源:   DOI:10.1016/j.compbiomed.2024.108859

Abstract:
O-linked glycosylation is a complex post-translational modification (PTM) in human proteins that plays a critical role in regulating various cellular metabolic and signaling pathways. In contrast to N-linked glycosylation, O-linked glycosylation lacks specific sequence features and maintains an unstable core structure. Identifying O-linked threonine glycosylation sites (OTGs) remains challenging, requiring extensive experimental tests. While bioinformatics tools have emerged for predicting OTGs, their reliance on limited conventional features and absence of well-defined feature selection strategies limit their effectiveness. To address these limitations, we introduced HOTGpred (Human O-linked Threonine Glycosylation predictor), employing a multi-stage feature selection process to identify the optimal feature set for accurately identifying OTGs. Initially, we assessed 25 different feature sets derived from various pretrained protein language model (PLM)-based embeddings and conventional feature descriptors using nine classifiers. Subsequently, we integrated the top five embeddings linearly and determined the most effective scoring function for ranking hybrid features, identifying the optimal feature set through a process of sequential forward search. Among the classifiers, the extreme gradient boosting (XGBT)-based model, using the optimal feature set (HOTGpred), achieved 92.03 % accuracy on the training dataset and 88.25 % on the balanced independent dataset. Notably, HOTGpred significantly outperformed the current state-of-the-art methods on both the balanced and imbalanced independent datasets, demonstrating its superior prediction capabilities. Additionally, SHapley Additive exPlanations (SHAP) and ablation analyses were conducted to identify the features contributing most significantly to HOTGpred. Finally, we developed an easy-to-navigate web server, accessible at https://balalab-skku.org/HOTGpred/, to support glycobiologists in their research on glycosylation structure and function.
摘要:
O-连接糖基化是人类蛋白质中复杂的翻译后修饰(PTM),在调节各种细胞代谢和信号通路中起关键作用。与N-连接糖基化相反,O-连接的糖基化缺乏特定的序列特征并且维持不稳定的核心结构。鉴定O-连接的苏氨酸糖基化位点(OTGs)仍然具有挑战性,需要大量的实验测试。虽然生物信息学工具已经出现用于预测OTG,它们对有限的传统特征的依赖和缺乏明确的特征选择策略限制了它们的有效性。为了解决这些限制,我们引入了HOTGpred(人类O-连接苏氨酸糖基化预测因子),采用多阶段特征选择过程来识别用于准确识别OTG的最佳特征集。最初,我们使用9个分类器评估了25个不同的特征集,这些特征集来自基于各种预训练的蛋白质语言模型(PLM)的嵌入和常规特征描述符。随后,我们线性地整合了前五个嵌入,并确定了对混合特征进行排名的最有效的评分函数,通过顺序前向搜索的过程来识别最佳特征集。在分类器中,基于极端梯度增强(XGBT)的模型,使用最佳功能集(HOTGpred),在训练数据集上达到92.03%的准确率,在平衡的独立数据集上达到88.25%。值得注意的是,HOTGpred在平衡和不平衡的独立数据集上的表现明显优于当前最先进的方法,展示了其优越的预测能力。此外,进行了SHapley附加移植(SHAP)和消融分析,以确定对HOTGpred贡献最大的特征。最后,我们开发了一个易于导航的网络服务器,可访问https://balalab-skku.org/HOTGpred/,支持糖生物学家对糖基化结构和功能的研究。
公众号