评估用于注释蛋白质的大型语言模型。Evaluating large language models for annotating proteins.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.

摘要：

在UniProtKB中，到目前为止,有超过2.51亿种蛋白质沉积。然而,只有0.25%的人被注释了超过15000个可能的Pfam家族域之一。当前的注释协议集成了来自手动策划的家族域的知识，使用序列比对和隐马尔可夫模型获得。这种方法已经成功地自动增加了Pfam注释，然而，与蛋白质发现相比，速度较低。就在几年前,提出了用于自动Pfam标注的深度学习模型。然而,这些模型需要大量的训练数据，这对人口稠密的家庭来说可能是一个挑战。为了解决这个问题，我们在这里提出并评估了一个基于迁移学习的新协议，他需要使用蛋白质大语言模型(LLM)，在大型非纳米数据集上进行自我监督训练，以获得序列嵌入。然后，嵌入可以与监督学习一起使用，在一个小的、带注释的数据集上进行专门任务。在这个协议中，我们已经评估了几种尖端的蛋白质LLM以及机器学习架构，以改善蛋白质域注释的实际预测。结果明显优于蛋白质家族分类的最新技术，与标准方法相比，预测误差降低了令人印象深刻的60%。我们解释了LLM嵌入如何以一种具体而简单的方式用于蛋白质注释，并在github回购中提供管道。完整的源代码和数据可在https://github.com/sinc-lab/llm4pfam获得。