零分临床自然语言处理中大型语言模型提示策略的实证评估：算法开发和验证研究。An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches.
OBJECTIVE: The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models.
METHODS: This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches.
RESULTS: The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types.
CONCLUSIONS: This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.

摘要：

背景：大型语言模型（LLM）在自然语言处理（NLP）中显示出非凡的能力，特别是在标记数据稀缺或昂贵的领域，例如临床领域。然而,为了解开隐藏在这些LLM中的临床知识，我们需要设计有效的提示,引导他们在没有任何任务特定训练数据的情况下执行特定的临床NLP任务.这被称为上下文学习，这是一门艺术和科学，需要了解不同LLM的优势和劣势，并迅速采用工程方法。
目的：本研究的目的是评估各种即时工程技术的有效性，包括2个新引入的类型-启发式和合奏提示，使用预训练的语言模型进行零射和少射临床信息提取。
方法：这项全面的实验研究评估了不同的提示类型（简单的前缀，简单的完形填空，思想链，预期,启发式，和合奏)跨越5个临床NLP任务：临床意义消歧，生物医学证据提取，共同参照决议，药物状态提取，和药物属性提取。使用3种最先进的语言模型评估了这些提示的性能：GPT-3.5（OpenAI），双子座（谷歌），和LLaMA-2（Meta）。该研究将零射与少射提示进行了对比，并探讨了合奏方法的有效性。
结果：研究表明，针对特定任务的提示定制对于LLM在零射临床NLP中的高性能至关重要。在临床意义上的消歧，GPT-3.5在启发式提示下达到0.96的准确性，在生物医学证据提取中达到0.94的准确性。启发式提示，伴随着一连串的思想提示，跨任务非常有效。在复杂的场景中，很少有机会提示提高性能，和集合方法利用了多种即时优势。GPT-3.5在任务和提示类型上的表现始终优于Gemini和LLaMA-2。
结论：本研究对即时工程方法进行了严格的评估，并介绍了临床信息提取的创新技术，证明了临床领域上下文学习的潜力。这些发现为未来基于提示的临床NLP研究提供了明确的指导方针。促进非NLP专家参与临床NLP进步。据我们所知,这是在这个生成人工智能时代，对临床NLP的不同提示工程方法进行实证评估的首批作品之一，我们希望它能激励和指导未来在这一领域的研究。