大型语言模型能理解分子吗？Can large language models understand molecules?-医云文献数字医云科研云海量医学决策数据服务

Abstract：

OBJECTIVE: Large Language Models (LLMs) like Generative Pre-trained Transformer (GPT) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations.
METHODS: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction.
RESULTS: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks.
CONCLUSIONS: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT .

摘要：

目的：大型语言模型（LLM），例如OpenAI的生成预训练转换器（GPT）和MetaAI的LLaMA（大型语言模型MetaAI），因其在化学信息学领域的潜力而日益受到认可。特别是在理解简化的分子输入线进入系统（SMILES），表示化学结构的标准方法。这些LLM还具有将SMILES字符串解码为向量表示的能力。
方法：我们研究了GPT和LLaMA与SMILES上的预训练模型相比在下游任务上嵌入SMILES字符串的性能，重点研究了两个关键应用:分子性质预测和药物相互作用预测。
结果：我们发现，使用LLaMA生成的SMILES嵌入在分子性质和DDI预测任务中都优于GPT。值得注意的是,基于LLaMA的SMILES嵌入在分子预测任务中显示出与SMILES上的预训练模型相当的结果，并且优于DDI预测任务的预训练模型。
结论：LLM在生成SMILES嵌入方面的性能显示出进一步研究这些分子嵌入模型的巨大潜力。我们希望我们的研究弥合LLM和分子嵌入之间的差距，激发对分子表示领域LLM潜力的额外研究。GitHub：https://github.com/sshaghayghs/LLaMA-VS-GPT。