METHODS: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction.
RESULTS: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks.
CONCLUSIONS: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embedding. We hope our study bridges the gap between LLMs and molecular embedding, motivating additional research into the potential of LLMs in the molecular representation field. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT .
方法:我们研究了GPT和LLaMA与SMILES上的预训练模型相比在下游任务上嵌入SMILES字符串的性能,重点研究了两个关键应用:分子性质预测和药物相互作用预测。
结果:我们发现,使用LLaMA生成的SMILES嵌入在分子性质和DDI预测任务中都优于GPT。值得注意的是,基于LLaMA的SMILES嵌入在分子预测任务中显示出与SMILES上的预训练模型相当的结果,并且优于DDI预测任务的预训练模型。
结论:LLM在生成SMILES嵌入方面的性能显示出进一步研究这些分子嵌入模型的巨大潜力。我们希望我们的研究弥合LLM和分子嵌入之间的差距,激发对分子表示领域LLM潜力的额外研究。GitHub:https://github.com/sshaghayghs/LLaMA-VS-GPT。