在 USMLE STEP 2 CK 和临床病例报告上评估 ChatGPT 4.0 的测试性能和临床诊断准确性。Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

While there is data assessing the test performance of artificial intelligence (AI) chatbots, including the Generative Pre-trained Transformer 4.0 (GPT 4) chatbot (ChatGPT 4.0), there is scarce data on its diagnostic accuracy of clinical cases. We assessed the large language model (LLM), ChatGPT 4.0, on its ability to answer questions from the United States Medical Licensing Exam (USMLE) Step 2, as well as its ability to generate a differential diagnosis based on corresponding clinical vignettes from published case reports. A total of 109 Step 2 Clinical Knowledge (CK) practice questions were inputted into both ChatGPT 3.5 and ChatGPT 4.0, asking ChatGPT to pick the correct answer. Compared to its previous version, ChatGPT 3.5, we found improved accuracy of ChatGPT 4.0 when answering these questions, from 47.7 to 87.2% (p = 0.035) respectively. Utilizing the topics tested on Step 2 CK questions, we additionally found 63 corresponding published case report vignettes and asked ChatGPT 4.0 to come up with its top three differential diagnosis. ChatGPT 4.0 accurately created a shortlist of differential diagnoses in 74.6% of the 63 case reports (74.6%). We analyzed ChatGPT 4.0\'s confidence in its diagnosis by asking it to rank its top three differentials from most to least likely. Out of the 47 correct diagnoses, 33 were the first (70.2%) on the differential diagnosis list, 11 were second (23.4%), and three were third (6.4%). Our study shows the continued iterative improvement in ChatGPT\'s ability to answer standardized USMLE questions accurately and provides insights into ChatGPT\'s clinical diagnostic accuracy.

摘要：

虽然有数据评估人工智能（AI）聊天机器人的测试性能，包括生成预训练变压器4.0(GPT4)聊天机器人(ChatGPT4.0)，关于其临床病例诊断准确性的数据很少。我们评估了大型语言模型(LLM)，ChatGPT4.0，其能够回答美国医疗许可考试（USMLE）步骤2的问题，以及根据已发表病例报告的相应临床插图生成鉴别诊断的能力。在ChatGPT3.5和ChatGPT4.0中输入了总共109个步骤2临床知识（CK）实践问题，要求ChatGPT选择正确的答案。与以前的版本相比，ChatGPT3.5，我们在回答这些问题时发现ChatGPT4.0的准确性得到了提高，从47.7到87.2%(p=0.035)。利用在步骤2CK问题上测试的主题，我们还发现了63份相应的已发表病例报告小插曲,并要求ChatGPT4.0提出其前三名的鉴别诊断.ChatGPT4.0在63例病例报告中的74.6％（74.6％）中准确地创建了鉴别诊断的候选列表。我们分析了ChatGPT4.0对其诊断的信心，要求它从最可能到最不可能排名其前三。在47个正确的诊断中,33个是鉴别诊断列表中的第一个（70.2%），11人排名第二（23.4%），三人排名第三(6.4%)。我们的研究表明，ChatGPT准确回答标准化USMLE问题的能力不断迭代改进，并提供了对ChatGPT临床诊断准确性的见解。