模式演变和系统角色对 ChatGPT 在中医执业考试中绩效影响的比较研究。Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

UNASSIGNED: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.
UNASSIGNED: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).
UNASSIGNED: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt\'s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model\'s accuracy and consistency.
UNASSIGNED: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response.
UNASSIGNED: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model\'s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

摘要：

■随着像ChatGPT这样的大型语言模型在各个行业中的应用越来越多，它在医疗领域的潜力，特别是在标准化考试中，已成为研究的重点。
■本研究的目的是评估ChatGPT的临床表现，重点关注其在中国国家医师资格考试(CNMLE)中的准确性和可靠性。
■CNMLE2022问题集，由500个单答案多选题组成，被重新分类为15个医学亚专科。从2023年4月24日至5月15日，每个问题在OpenAI平台上用中文进行了8到12次测试。考虑了三个关键因素：GPT-3.5和4.0版本，针对医疗亚专科定制的系统角色的提示指定，为了连贯性而重复。通过准确度阈值被建立为60%。采用χ2检验和κ值评估模型的准确性和一致性。
■GPT-4.0达到了72.7%的通过精度，显著高于GPT-3.5(54%；P<.001)。GPT-4.0重复反应的变异性低于GPT-3.5（9％vs19.5％；P<.001）。然而,两个模型都显示出相对较好的响应一致性，κ值分别为0.778和0.610。系统角色在数值上提高了GPT-4.0（0.3％-3.7％）和GPT-3.5（1.3％-4.5％）的准确性，并将变异性降低了1.7%和1.8%，分别（P>0.05）。在亚组分析中，ChatGPT在不同题型之间取得了相当的准确率(P>.05)。GPT-4.0在15个亚专业中的14个超过了准确性阈值，而GPT-3.5在第一次反应的15人中有7人这样做。
■GPT-4.0通过了CNMLE，并在准确性等关键领域优于GPT-3.5，一致性,和医学专科专业知识。添加系统角色不会显着增强模型的可靠性和答案的连贯性。GPT-4.0在医学教育和临床实践中显示出有希望的潜力，值得进一步研究。