关键词: Artificial intelligence Bilateral vocal fold paralysis ChatGPT Decision-making Laryngology Llama

来  源:   DOI:10.1016/j.jvoice.2024.02.020

Abstract:
OBJECTIVE: The development of artificial intelligence-powered language models, such as Chatbot Generative Pre-trained Transformer (ChatGPT) or Large Language Model Meta AI (Llama), is emerging in medicine. Patients and practitioners have full access to chatbots that may provide medical information. The aim of this study was to explore the performance and accuracy of ChatGPT and Llama in treatment decision-making for bilateral vocal fold paralysis (BVFP).
METHODS: Data of 20 clinical cases, treated between 2018 and 2023, were retrospectively collected from four tertiary laryngology centers in Europe. The cases were defined as the most common or most challenging scenarios regarding BVFP treatment. The treatment proposals were discussed in their local multidisciplinary teams (MDT). Each case was presented to ChatGPT-4.0 and Llama Chat-2.0, and potential treatment strategies were requested. The Artificial Intelligence Performance Instrument (AIPI) treatment subscore was used to compare both Chatbots\' performances to MDT treatment proposal.
RESULTS: Most common etiology of BVFP was thyroid surgery. A form of partial arytenoidectomy with or without posterior transverse cordotomy was the MDT proposal for most cases. The accuracy of both Chatbots was very low regarding their treatment proposals, with a maximum AIPI treatment score in 5% of the cases. In most cases even harmful assertions were made, including the suggestion of vocal fold medialisation to treat patients with stridor and dyspnea. ChatGPT-4.0 performed significantly better in suggesting the correct treatment as part of the treatment proposal (50%) compared to Llama Chat-2.0 (15%).
CONCLUSIONS: ChatGPT and Llama are judged as inaccurate in proposing correct treatment for BVFP. ChatGPT significantly outperformed Llama. Treatment decision-making for a complex condition such as BVFP is clearly beyond the Chatbot\'s knowledge expertise. This study highlights the complexity and heterogeneity of BVFP treatment, and the need for further guidelines dedicated to the management of BVFP.
摘要:
目标:开发人工智能驱动的语言模型,例如Chatbot生成预训练转换器(ChatGPT)或大型语言模型元AI(Llama),正在出现在医学上。患者和从业者可以完全访问可能提供医疗信息的聊天机器人。这项研究的目的是探讨ChatGPT和Llama在双侧声带麻痹(BVFP)治疗决策中的表现和准确性。
方法:20例临床病例资料,从欧洲的四个三级喉科中心回顾性收集了2018年至2023年之间的治疗。这些病例被定义为关于BVFP治疗的最常见或最具挑战性的方案。在当地的多学科小组(MDT)中讨论了治疗建议。每个病例都被提交给ChatGPT-4.0和LlamaChat-2.0,并要求潜在的治疗策略。人工智能性能仪器(AIPI)治疗子评分用于将两种Chatbots的性能与MDT治疗方案进行比较。
结果:BVFP最常见的病因是甲状腺手术。在大多数情况下,MDT建议采用一种有或没有后横断切开术的部分软骨切除术。两个聊天机器人的治疗方案的准确性都很低,在5%的病例中,AIPI治疗评分最高。在大多数情况下,甚至会做出有害的断言,包括建议声带内在化治疗喘鸣和呼吸困难患者。与LlamaChat-2.0(15%)相比,ChatGPT-4.0在建议正确治疗作为治疗方案的一部分(50%)方面表现明显更好。
结论:ChatGPT和Llama在提出BVFP的正确治疗时被认为是不准确的。ChatGPT的表现明显优于Llama。BVFP等复杂疾病的治疗决策显然超出了Chatbot的专业知识。这项研究强调了BVFP治疗的复杂性和异质性,以及需要进一步的指导原则专门管理BVFP。
公众号