关键词: ChatGPT Claude LLM evaluation LLaMA Large language models Natural language processing PaLM Transformer

Mesh : Female Humans Benchmarking Language Uterus

来  源:   DOI:10.1016/j.compbiomed.2024.108189

Abstract:
Recently, Large Language Models (LLMs) have demonstrated impressive capability to solve a wide range of tasks. However, despite their success across various tasks, no prior work has investigated their capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of LLMs on benchmark biomedical tasks. For this purpose, a comprehensive evaluation of 4 popular LLMs in 6 diverse biomedical tasks across 26 datasets has been conducted. To the best of our knowledge, this is the first work that conducts an extensive evaluation and comparison of various LLMs in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot LLMs even outperform the current state-of-the-art models when they were fine-tuned only on the training set of these datasets. This suggests that pre-training on large text corpora makes LLMs quite specialized even in the biomedical domain. We also find that not a single LLM can outperform other LLMs in all tasks, with the performance of different LLMs may vary depending on the task. While their performance is still quite poor in comparison to the biomedical models that were fine-tuned on large training sets, our findings demonstrate that LLMs have the potential to be a valuable tool for various biomedical tasks that lack large annotated data.
摘要:
最近,大型语言模型(LLM)已经展示了解决各种任务的令人印象深刻的能力。然而,尽管他们在各种任务中取得了成功,以前的工作还没有调查他们在生物医学领域的能力。为此,本文旨在评估LLM在基准生物医学任务上的性能。为此,对26个数据集的6个不同生物医学任务中的4个流行LLM进行了综合评估。据我们所知,这是对生物医学领域的各种LLM进行广泛评估和比较的第一项工作。有趣的是,根据我们的评估,我们发现在具有较小训练集的生物医学数据集中,零拍LLM甚至优于当前最先进的模型,因为它们仅在这些数据集的训练集上进行了微调。这表明,大型文本语料库的预培训使LLM即使在生物医学领域也非常专业。我们还发现,没有一个LLM可以在所有任务中胜过其他LLM,不同LLM的性能可能因任务而异。虽然与在大型训练集上进行微调的生物医学模型相比,它们的性能仍然相当差,我们的发现表明,LLM有可能成为缺乏大量注释数据的各种生物医学任务的有价值的工具。
公众号