关键词: AI assistance Bard ChatGPT 3.5 GPT-4 MedAlpaca artificial intelligence clinical decision support complex diagnosis complex diseases consistency language model medical education medical training natural language processing prediction model prompt engineering rare diseases reliability

Mesh : Humans Reproducibility of Results Learning Educational Status Problem Solving Language

来  源:   DOI:10.2196/51391   PDF(Pubmed)

Abstract:
BACKGROUND: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains.
OBJECTIVE: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance.
METHODS: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks.
RESULTS: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs.
CONCLUSIONS: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model\'s characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes.
摘要:
背景:患有罕见和复杂疾病的患者通常会出现诊断延迟和误诊,因为有关这些疾病的全面知识仅限于少数医学专家。在这种情况下,大型语言模型(LLM)已成为强大的知识聚合工具,在临床决策支持和教育领域具有应用。
目的:本研究旨在探索3种流行的LLM的潜力,即巴德(谷歌有限责任公司),ChatGPT-3.5(OpenAI),和GPT-4(OpenAI),在医学教育中加强对罕见和复杂疾病的诊断,同时研究及时工程对其性能的影响。
方法:我们对公开的复杂和罕见病例进行了实验,以实现这些目标。我们实施了各种提示策略,以使用开放式和多项选择提示来评估这些模型的性能。此外,我们使用了多数投票策略来利用语言模型中不同的推理路径,旨在提高其可靠性。此外,我们将他们的表现与人类受访者和MedAlpaca的表现进行了比较,专门为医疗任务设计的生成LLM。
结果:值得注意的是,所有LLM的表现都优于平均人类共识和MedAlpaca,最低利润率为5%和13%,分别,在诊断病例挑战收集的所有30例病例中。在经常误诊的病例类别上,吟游诗人与MedAlpaca并列,但超过人类平均共识14%,而GPT-4和ChatGPT-3.5在中度误诊病例类别中的表现优于MedAlpaca和人类受访者,其最低准确度得分分别为28%和11%,分别。多数投票策略,特别是GPT-4,在诊断复杂病例收集的所有病例中,总体得分最高,超越其他LLM。在重症监护III数据集的医疗信息集市上,Bard和GPT-4获得了最高的诊断准确性评分,多项选择提示得分93%,而ChatGPT-3.5和MedAlpaca得分分别为73%和47%,分别。此外,我们的研究结果表明,对于提高LLM的性能,并不存在一刀切的提示方法,而且单一的策略并不普遍适用于所有LLM。
结论:我们的研究结果揭示了LLM的诊断能力,以及与确定符合每种语言模型特征和特定任务要求的最佳提示策略相关的挑战。强调了提示工程的意义,为使用这些语言模型进行医学培训的研究人员和从业人员提供有价值的见解。此外,这项研究代表了了解LLM如何在罕见和复杂的医学病例中增强诊断推理的关键一步,为开发有效的教育工具和准确的诊断工具铺平道路,以改善患者护理和结果。
公众号