关键词: enzyme prediction evolutionary inference interpretability analysis large language model protein motif detection self-guided attentive learning

Mesh : Deep Learning Enzymes / chemistry metabolism Computational Biology / methods Software Proteins / chemistry metabolism Databases, Protein Algorithms

来  源:   DOI:10.1093/bib/bbae225   PDF(Pubmed)

Abstract:
Accurate understanding of the biological functions of enzymes is vital for various tasks in both pathologies and industrial biotechnology. However, the existing methods are usually not fast enough and lack explanations on the prediction results, which severely limits their real-world applications. Following our previous work, DEEPre, we propose a new interpretable and fast version (ifDEEPre) by designing novel self-guided attention and incorporating biological knowledge learned via large protein language models to accurately predict the commission numbers of enzymes and confirm their functions. Novel self-guided attention is designed to optimize the unique contributions of representations, automatically detecting key protein motifs to provide meaningful interpretations. Representations learned from raw protein sequences are strictly screened to improve the running speed of the framework, 50 times faster than DEEPre while requiring 12.89 times smaller storage space. Large language modules are incorporated to learn physical properties from hundreds of millions of proteins, extending biological knowledge of the whole network. Extensive experiments indicate that ifDEEPre outperforms all the current methods, achieving more than 14.22% larger F1-score on the NEW dataset. Furthermore, the trained ifDEEPre models accurately capture multi-level protein biological patterns and infer evolutionary trends of enzymes by taking only raw sequences without label information. Meanwhile, ifDEEPre predicts the evolutionary relationships between different yeast sub-species, which are highly consistent with the ground truth. Case studies indicate that ifDEEPre can detect key amino acid motifs, which have important implications for designing novel enzymes. A web server running ifDEEPre is available at https://proj.cse.cuhk.edu.hk/aihlab/ifdeepre/ to provide convenient services to the public. Meanwhile, ifDEEPre is freely available on GitHub at https://github.com/ml4bio/ifDEEPre/.
摘要:
准确理解酶的生物学功能对于病理学和工业生物技术中的各种任务至关重要。然而,现有方法通常速度不够快,对预测结果缺乏解释,这严重限制了它们的实际应用。根据我们之前的工作,Deepre,我们通过设计新颖的自我引导注意力并结合通过大型蛋白质语言模型学习的生物学知识,提出了一种新的可解释和快速版本(ifDEEPre),以准确预测酶的佣金数量并确认其功能。新颖的自我引导注意力旨在优化表征的独特贡献,自动检测关键蛋白质基序以提供有意义的解释。从原始蛋白质序列中学习的表示经过严格筛选,以提高框架的运行速度,比DEEPre快50倍,同时需要小12.89倍的存储空间。大型语言模块被纳入,以学习数以亿计的蛋白质的物理特性,扩展整个网络的生物学知识。大量的实验表明,如果DEEPre优于所有当前的方法,在新数据集上实现超过14.22%的F1分数。此外,经过训练的ifDEEPre模型通过仅获取没有标记信息的原始序列来准确捕获多级蛋白质生物学模式并推断酶的进化趋势。同时,如果DEEPre预测不同酵母亚种之间的进化关系,这与地面事实高度一致。案例研究表明,如果DEEPre能够检测到关键的氨基酸基序,这对设计新型酶具有重要意义。运行ifDEEPre的Web服务器可在https://proj获得。CSE。中大。edu.hk/aihlab/ifdeepre/为公众提供便捷的服务。同时,ifDEEPre可在GitHub上免费获得,网址为https://github.com/ml4bio/ifDEEPre/。
公众号