ChatGPT 与顾问：对回答耳鼻咽喉科病例问题的盲目评估。ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more \"consultations\" of LLMs about personal medical symptoms.
OBJECTIVE: This study aims to evaluate ChatGPT\'s performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants\' answers.
METHODS: We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs.
RESULTS: Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT\'s scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT\'s answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001).
CONCLUSIONS: While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants\' answers. LLMs have potential as augmentative tools for medical care, but their \"consultation\" for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.

摘要：

背景：大型语言模型(LLM)，例如ChatGPT(开放式AI)，越来越多地用于医学和补充标准搜索引擎作为信息源。这导致了关于个人医疗症状的LLM的更多“咨询”。
目的：本研究旨在评估ChatGPT在回答耳鼻咽喉科（ORL）临床病例问题方面的表现，并与ORL顾问的回答进行比较。
方法：我们使用了41个基于案例的问题，这些问题来自已建立的ORL研究书籍和过去的德国州医生考试。ORL顾问和ChatGPT3都回答了这些问题。ORL顾问对所有回复进行了评级，除了自己的,关于医疗充分性，简洁,连贯性,和可理解性使用6点Likert量表。他们还确定（在盲区）答案是否由ORL顾问或ChatGPT创建。此外,比较了字符计数。由于技术的快速发展，通过对ChatGPT3和ChatGPT4产生的反应进行比较，以深入了解LLM的发展潜力。
结果：ORL顾问在所有类别中的评分均显著较高（P<.001）。尽管低于ORL顾问的分数，ChatGPT的分数在语义类别中相对较高(简洁性，连贯性,和可理解性)与医疗充分性相比。ORL顾问在98.4%(121/123)的病例中正确确定了ChatGPT为来源。与ORL顾问相比，ChatGPT的答案具有明显更高的字符数（P<.001）。ChatGPT3和ChatGPT4产生的响应之间的比较显示，医疗准确性略有提高，所提供的答案也有更好的连贯性。相反，尽管字符的平均数量显着增加了52.5％（n=（1470-964）/964；P<.001），但简洁性（P=.06）和可理解性（P=.08）均未显着改善。
结论：虽然ChatGPT为医疗问题提供了更长的答案，与ORL顾问的答案相比，医疗充分性和简洁性明显较低。LLM有潜力作为医疗保健的增强工具，但是他们对医疗问题的“咨询”具有很高的错误信息风险，因为他们的高语义质量可能掩盖上下文缺陷。