在 2377 USMLE 步骤 1 风格问题的题干中，基于特定的信号单词和短语对 ChatGPT 的性能进行了深入分析。In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

ChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT\'s capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT\'s overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with rs = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = \"what is the most likely/probable cause\"). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.

摘要：

ChatGPT作为一个多方面的AI聊天机器人，在医学上具有潜在的应用，已经引起了人们的关注。尽管在临床管理和患者教育等领域有有趣的初步发现，在全面了解ChatGPT能力的机会和局限性方面，仍然存在很大的知识差距，尤其是在医学考试和教育方面。从Amboss题库中提取了总共n=2,729个USMLE步骤1练习题。排除352个基于图像的问题后，总共2,377个基于文本的问题被进一步分类并手动输入到ChatGPT中，并记录了它的反应。ChatGPT的整体性能进行了分析，基于问题的难度，类别,以及关于特定信号单词和短语的内容。ChatGPT在从Amboss在线题库获得的n=2,377USMLE步骤1准备问题的总数中，总体准确率为55.8％。它证明了问题难度和性能之间的显着负相关，rs=-0.306;p<0.001，在不同级别的问题难度中保持与人类用户同伴组相当的准确性。值得注意的是,ChatGPT在血清学相关问题中表现优于（61.1%与53.8%；p=0.005），但与ECG相关的内容（42.9%vs.55.6%；p=0.021）。ChatGPT在病理生理学相关问题茎中取得了统计学上显着的较差表现。(信号短语=“什么是最可能/可能的原因”)。ChatGPT在各种问题类别和难度级别上表现一致。这些发现强调需要进一步调查以探索ChatGPT在医学检查和教育中的潜力和局限性。