关键词: clinical knowledge large language models medical examination natural language processing

来  源:   DOI:10.1093/jamia/ocae079

Abstract:
OBJECTIVE: Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. However, these English-centric models encounter challenges in non-English clinical settings, primarily due to limited clinical knowledge in respective languages, a consequence of imbalanced training corpora. We systematically evaluate LLMs in the Chinese medical context and develop a novel in-context learning framework to enhance their performance.
METHODS: The latest China National Medical Licensing Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books and 381 149 medical questions to construct the medical knowledge base and question bank. The proposed Knowledge and Few-shot Enhancement In-context Learning (KFE) framework leverages the in-context learning ability of LLMs to integrate diverse external clinical knowledge sources. We evaluated KFE with ChatGPT (GPT-3.5), GPT-4, Baichuan2-7B, Baichuan2-13B, and QWEN-72B in CNMLE-2022 and further investigated the effectiveness of different pathways for incorporating LLMs with medical knowledge from 7 distinct perspectives.
RESULTS: Directly applying ChatGPT failed to qualify for the CNMLE-2022 at a score of 51. Cooperated with the KFE framework, the LLMs with varying sizes yielded consistent and significant improvements. The ChatGPT\'s performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This surpasses the qualification threshold (60) and exceeds the average human score of 68.70, affirming the effectiveness and robustness of the framework. It also enabled a smaller Baichuan2-13B to pass the examination, showcasing the great potential in low-resource settings.
CONCLUSIONS: This study shed light on the optimal practices to enhance the capabilities of LLMs in non-English medical scenarios. By synergizing medical knowledge through in-context learning, LLMs can extend clinical insight beyond language barriers in healthcare, significantly reducing language-related disparities of LLM applications and ensuring global benefit in this field.
摘要:
目标:大型语言模型(LLM),例如ChatGPT和Med-PaLM,在各种医学问答任务中都表现出色。然而,这些以英语为中心的模型在非英语临床环境中遇到挑战,主要是由于各自语言的临床知识有限,训练语料库不平衡的结果。我们系统地评估了中国医学背景下的LLM,并开发了一种新颖的背景学习框架来提高他们的表现。
方法:最新的中国国家医学执业资格考试(CNMLE-2022)作为基准。我们收集了53种医学书籍和381.149种医学问题,以构建医学知识库和题库。拟议的知识和少量增强上下文学习(KFE)框架利用LLM的上下文学习能力来整合各种外部临床知识源。我们用ChatGPT(GPT-3.5)评估了KFE,GPT-4,百川2-7B,百川2-13B,和QWEN-72B在CNMLE-2022中,从7个不同的角度进一步研究了不同途径将LLM与医学知识相结合的有效性。
结果:直接应用ChatGPT未能获得CNMLE-2022的资格,得分为51。与KFE框架合作,不同大小的LLM产生了一致和显著的改进。ChatGPT的表现飙升至70.04,GPT-4的最高得分为82.59。这超过了资格阈值(60)并且超过了68.70的平均人类得分,确认了该框架的有效性和鲁棒性。它还使较小的百川2-13B通过了考试,展示了低资源环境中的巨大潜力。
结论:这项研究揭示了在非英语医疗场景中增强LLM能力的最佳实践。通过上下文学习协同医学知识,LLM可以将临床洞察力扩展到医疗保健中的语言障碍之外,显着减少LLM应用程序的语言相关差异,并确保该领域的全球利益。
公众号