背景:人工智能正越来越多地应用于许多工作流程。大型语言模型(LLM)是可公开访问的平台,经过训练可以理解,互动,并产生人类可读的文本;他们提供相关和可靠信息的能力也是医疗保健提供者和患者特别感兴趣的。造血干细胞移植(HSCT)是一个复杂的医学领域,背景,以及成功练习的培训,对于非专业观众来说可能是具有挑战性的。
目的:我们旨在测试3种突出的LLM的适用性,即ChatGPT-3.5(OpenAI),ChatGPT-4(OpenAI),和巴德(谷歌AI),指导非专业卫生保健专业人员和建议患者寻求有关HSCT的信息。
方法:我们向LLM提交了72个与HSCT相关的难度可变的开放式问题,并根据一致性(定义为响应-响应准确性的可复制性)对其响应进行评级,语言可理解性,对主题的特异性,和幻觉的存在。然后,我们通过重新提交最困难的问题并提示与医疗保健专业人员或患者沟通并提供可验证的信息源来挑战2个表现最好的聊天机器人。然后用语言适当性的附加标准重新评估回答,定义为针对预期受众的语言改编。
结果:ChatGPT-4在反应一致性方面优于ChatGPT-3.5和Bard(66/72,92%;54/72,75%;63/69,91%,分别为;P=.007),反应准确性(58/66,88%;40/54,74%;16/63,25%,分别;P<.001),和对主题的特异性(60/66,91%;43/54,80%;和27/63,43%,分别;P<.001)。ChatGPT-4和ChatGPT-3.5在语言可理解性方面均优于Bard(64/66,97%;53/54,98%;52/63,83%,分别为;P=.002)。所有都显示幻觉发作。然后,ChatGPT-3.5和ChatGPT-4再次受到挑战,要求他们适应听众的语言并提供信息来源,并对回答进行了评级。ChatGPT-3.5显示出比ChatGPT-4更好的语言适应非医学受众的能力(17/21,81%和10/22,46%,分别;P=0.03);然而,两者都未能始终如一地提供正确和最新的信息资源,报告过时的材料,不正确的URL,或者不集中的参考文献,使他们的输出无法由读者验证。
结论:结论:尽管法学硕士在应对HSCT等具有挑战性的医学课题方面具有潜在的能力,错误的存在和缺乏明确的参考使得它们还不适合日常工作,无监督的临床使用,或耐心咨询。实现LLM访问和参考当前和更新的网站和研究论文的能力,以及在专业领域知识数据集中培训的LLM的开发,可能为其未来的临床应用提供潜在的解决方案。
BACKGROUND: Artificial intelligence is increasingly being applied to many workflows. Large language models (LLMs) are publicly accessible platforms trained to understand, interact with, and produce human-readable text; their ability to deliver relevant and reliable information is also of particular interest for the health care providers and the patients.
Hematopoietic stem cell transplantation (HSCT) is a complex medical field requiring extensive knowledge, background, and training to practice successfully and can be challenging for the nonspecialist audience to comprehend.
OBJECTIVE: We aimed to test the applicability of 3 prominent LLMs, namely ChatGPT-3.5 (OpenAI), ChatGPT-4 (OpenAI), and Bard (Google AI), in guiding nonspecialist health care professionals and advising patients seeking information regarding HSCT.
METHODS: We submitted 72 open-ended HSCT-related questions of variable difficulty to the LLMs and rated their responses based on consistency-defined as replicability of the response-response veracity, language comprehensibility, specificity to the topic, and the presence of hallucinations. We then rechallenged the 2 best performing chatbots by resubmitting the most difficult questions and prompting to respond as if communicating with either a health care professional or a patient and to provide verifiable sources of information. Responses were then rerated with the additional criterion of language appropriateness, defined as language adaptation for the intended audience.
RESULTS: ChatGPT-4 outperformed both ChatGPT-3.5 and Bard in terms of response consistency (66/72, 92%; 54/72, 75%; and 63/69, 91%, respectively; P=.007), response veracity (58/66, 88%; 40/54, 74%; and 16/63, 25%, respectively; P<.001), and specificity to the topic (60/66, 91%; 43/54, 80%; and 27/63, 43%, respectively; P<.001). Both ChatGPT-4 and ChatGPT-3.5 outperformed Bard in terms of language comprehensibility (64/66, 97%; 53/54, 98%; and 52/63, 83%, respectively; P=.002). All displayed episodes of hallucinations. ChatGPT-3.5 and ChatGPT-4 were then rechallenged with a prompt to adapt their language to the audience and to provide source of information, and responses were rated. ChatGPT-3.5 showed better ability to adapt its language to nonmedical audience than ChatGPT-4 (17/21, 81% and 10/22, 46%, respectively; P=.03); however, both failed to consistently provide correct and up-to-date information resources, reporting either out-of-date materials, incorrect URLs, or unfocused references, making their output not verifiable by the reader.
CONCLUSIONS: In conclusion, despite LLMs\' potential capability in confronting challenging medical topics such as HSCT, the presence of mistakes and lack of clear references make them not yet appropriate for routine, unsupervised clinical use, or patient counseling. Implementation of LLMs\' ability to access and to reference current and updated websites and research papers, as well as development of LLMs trained in specialized domain knowledge data sets, may offer potential solutions for their future clinical application.