METHODS: Clinical history and imaging findings, provided textually by the case submitters, were extracted from 324 quiz questions originating from Radiology Diagnosis Please cases published between 1998 and 2023. The top three differential diagnoses were generated by GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro, using their respective application programming interfaces. A comparative analysis of diagnostic performance among these three LLMs was conducted using Cochrane\'s Q and post hoc McNemar\'s tests.
RESULTS: The respective diagnostic accuracies of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro for primary diagnosis were 41.0%, 54.0%, and 33.9%, which further improved to 49.4%, 62.0%, and 41.0%, when considering the accuracy of any of the top three differential diagnoses. Significant differences in the diagnostic performance were observed among all pairs of models.
CONCLUSIONS: Claude 3 Opus outperformed GPT-4o and Gemini 1.5 Pro in solving radiology quiz cases. These models appear capable of assisting radiologists when supplied with accurate evaluations and worded descriptions of imaging findings.
方法:临床病史和影像学发现,由案例提交人文本提供,是从1998年至2023年之间发表的324个来自放射学诊断请案例的测验问题中提取的。前三个鉴别诊断由GPT-4o产生,克劳德3号作品,和双子座1.5Pro,使用各自的应用程序编程接口。使用Cochrane的Q和事后McNemar测试对这三种LLM的诊断性能进行了比较分析。
结果:GPT-4o各自的诊断准确性,克劳德3号作品,和Gemini1.5Pro的主要诊断为41.0%,54.0%,和33.9%,进一步提高到49.4%,62.0%,和41.0%,当考虑前三名鉴别诊断的准确性时。在所有模型对中观察到诊断性能的显着差异。
结论:Claude3Opus在解决放射学测验病例方面优于GPT-4o和Gemini1.5Pro。当提供准确的评估和成像发现的措辞描述时,这些模型似乎能够帮助放射科医生。