GPT - 4o 的诊断性能，克劳德 3 号作品，和双子座 1.5 Pro 在 “请诊断 ” 的情况下。Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

OBJECTIVE: Large language models (LLMs) are rapidly advancing and demonstrating high performance in understanding textual information, suggesting potential applications in interpreting patient histories and documented imaging findings. As LLMs continue to improve, their diagnostic abilities are expected to be enhanced further. However, there is a lack of comprehensive comparisons between LLMs from different manufacturers. In this study, we aimed to test the diagnostic performance of the three latest major LLMs (GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro) using Radiology Diagnosis Please Cases, a monthly diagnostic quiz series for radiology experts.
METHODS: Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane\'s Q and post hoc McNemar\'s tests.
RESULTS: The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models.
CONCLUSIONS: Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.

摘要：

目标：大型语言模型（LLM）正在迅速发展，并在理解文本信息方面表现出高性能，建议在解释患者病史和记录的影像学发现方面的潜在应用。随着LLM的不断改进，他们的诊断能力有望进一步提高。然而,不同制造商的LLM之间缺乏全面的比较。在这项研究中,我们旨在测试三个最新的主要LLM(GPT-4o，克劳德3号作品，和Gemini1.5Pro)使用放射学诊断请案例，放射学专家的每月诊断测验系列。
方法：临床病史和影像学发现，由案例提交人文本提供，是从1998年至2023年之间发表的324个来自放射学诊断请案例的测验问题中提取的。前三个鉴别诊断由GPT-4o产生，克劳德3号作品，和双子座1.5Pro，使用各自的应用程序编程接口。使用Cochrane的Q和事后McNemar测试对这三种LLM的诊断性能进行了比较分析。
结果：GPT-4o各自的诊断准确性，克劳德3号作品，和Gemini1.5Pro的主要诊断为41.0%，54.0%,和33.9%，进一步提高到49.4%，62.0%,和41.0%，当考虑前三名鉴别诊断的准确性时。在所有模型对中观察到诊断性能的显着差异。
结论：Claude3Opus在解决放射学测验病例方面优于GPT-4o和Gemini1.5Pro。当提供准确的评估和成像发现的措辞描述时，这些模型似乎能够帮助放射科医生。