关键词: ESM2 models deep learning ensemble method fine-tuning peptide toxicity protein toxicity

Mesh : Proteins / metabolism chemistry Machine Learning Databases, Protein Computational Biology / methods Humans Peptides / toxicity chemistry Computer Simulation Algorithms Software

来  源:   DOI:10.1093/bib/bbae270   PDF(Pubmed)

Abstract:
Peptide- and protein-based therapeutics are becoming a promising treatment regimen for myriad diseases. Toxicity of proteins is the primary hurdle for protein-based therapies. Thus, there is an urgent need for accurate in silico methods for determining toxic proteins to filter the pool of potential candidates. At the same time, it is imperative to precisely identify non-toxic proteins to expand the possibilities for protein-based biologics. To address this challenge, we proposed an ensemble framework, called VISH-Pred, comprising models built by fine-tuning ESM2 transformer models on a large, experimentally validated, curated dataset of protein and peptide toxicities. The primary steps in the VISH-Pred framework are to efficiently estimate protein toxicities taking just the protein sequence as input, employing an under sampling technique to handle the humongous class-imbalance in the data and learning representations from fine-tuned ESM2 protein language models which are then fed to machine learning techniques such as Lightgbm and XGBoost. The VISH-Pred framework is able to correctly identify both peptides/proteins with potential toxicity and non-toxic proteins, achieving a Matthews correlation coefficient of 0.737, 0.716 and 0.322 and F1-score of 0.759, 0.696 and 0.713 on three non-redundant blind tests, respectively, outperforming other methods by over $10\\%$ on these quality metrics. Moreover, VISH-Pred achieved the best accuracy and area under receiver operating curve scores on these independent test sets, highlighting the robustness and generalization capability of the framework. By making VISH-Pred available as an easy-to-use web server, we expect it to serve as a valuable asset for future endeavors aimed at discerning the toxicity of peptides and enabling efficient protein-based therapeutics.
摘要:
基于肽和蛋白质的疗法正在成为多种疾病的有希望的治疗方案。蛋白质的毒性是基于蛋白质的疗法的主要障碍。因此,迫切需要准确的计算机方法来确定有毒蛋白质,以过滤潜在的候选物。同时,必须精确识别无毒蛋白质,以扩大基于蛋白质的生物制剂的可能性。为了应对这一挑战,我们提出了一个集成框架,叫做VISH-Pred,包括通过在大型上微调ESM2变压器模型而构建的模型,实验验证,精选的蛋白质和肽毒性数据集。VISH-Pred框架中的主要步骤是仅以蛋白质序列作为输入来有效估计蛋白质毒性。采用欠采样技术来处理数据中的巨大类不平衡,并从经过微调的ESM2蛋白质语言模型中学习表示,然后将其提供给诸如Lightgbm和XGBoost之类的机器学习技术。VISH-Pred框架能够正确识别具有潜在毒性的肽/蛋白质和无毒蛋白质,在三个非冗余盲测试中,马修斯相关系数为0.737、0.716和0.322,F1评分为0.759、0.696和0.713,分别,在这些质量指标上,性能优于其他方法超过$10\\%$。此外,VISH-Pred在这些独立测试集上取得了最佳的准确性和接收器工作曲线下面积评分,突出了框架的健壮性和泛化能力。通过使VISH-Pred成为易于使用的Web服务器,我们希望它作为一个宝贵的资产,为未来的努力,旨在辨别肽的毒性,并使有效的蛋白质为基础的治疗。
公众号