背景:大型语言模型(LLM)在各种医学领域都表现出令人印象深刻的表现,促使探索他们在急诊室(ED)分诊的高需求设置中的潜在效用。本研究评估了不同LLM和ChatGPT的分诊能力,基于LLM的聊天机器人,与受过专业培训的ED员工和未经培训的人员相比。我们进一步探讨了LLM响应是否可以指导未经培训的员工进行有效的分诊。
目的:本研究旨在评估LLM和相关产品ChatGPT在ED分诊中与不同培训状态的人员相比的功效,并调查模型的反应是否可以提高未培训人员的分诊熟练程度。
方法:由未经培训的医生对总共124个匿名病例进行了分类;当前可用的LLM的不同版本;ChatGPT;以及受过专业培训的评估者,他们随后根据曼彻斯特分诊系统(MTS)达成共识。原型插图改编自德国三级ED的案例。主要结果是评分者之间的协议水平,MTS级别的分配,通过二次加权科恩κ测量。还确定了过度和未充分就诊的程度。值得注意的是,使用零剂量方法提示ChatGPT的实例,而没有关于MTS的大量背景信息.测试的LLM包括原始GPT-4,Llama370B,双子座1.5和混合8x7b。
结果:基于GPT-4的ChatGPT和未经培训的医生与专业评估者的共识分类基本一致(分别为κ=平均值0.67,SD0.037和κ=平均值0.68,SD0.056),显著超过基于GPT-3.5的ChatGPT的性能(κ=平均值0.54,SD0.024;P<.001)。当未经培训的医生使用此LLM进行第二意见分诊时,性能略有提高,但统计学上无统计学意义(κ=平均值0.70,SD0.047;P=0.97)。其他测试的LLM与基于GPT-4的ChatGPT相似或更差,或者显示出使用参数的奇怪分类行为。LLM和ChatGPT模型倾向于过度分类,而未受过训练的医生则不成熟。
结论:WhileLLMandtheLLM-basedproductChatGPTdonotyetmatchprofessionallytrainedraters,他们最好的模型\'分诊熟练程度等于未经培训的ED医生。以目前的形式,因此,LLM或ChatGPT在ED分诊中没有表现出黄金标准的表现,在这项研究的背景下,当用作决策支持时,未能显著改善未经培训的医生分诊。较新的LLM版本相对于较旧版本的显着性能增强暗示了未来的改进与进一步的技术开发和特定的培训。
BACKGROUND: Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This
study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage.
OBJECTIVE: This
study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models\' responses can enhance the triage proficiency of untrained personnel.
METHODS: A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters\' MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b.
RESULTS: GPT-4-based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5-based ChatGPT (κ=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (κ=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4-based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained
doctors undertriaged.
CONCLUSIONS: While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models\' triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this
study, failed to significantly improve untrained
doctors\' triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training.