关键词: Artificial intelligence ChatGPT Medical education Medical examination Natural language processing

Mesh : China Data Accuracy Artificial Intelligence Educational Measurement Licensure Education, Nursing Education, Pharmacy Education, Medical

来  源:   DOI:10.1186/s12909-024-05125-7   PDF(Pubmed)

Abstract:
BACKGROUND: Large language models like ChatGPT have revolutionized the field of natural language processing with their capability to comprehend and generate textual content, showing great potential to play a role in medical education. This study aimed to quantitatively evaluate and comprehensively analysis the performance of ChatGPT on three types of national medical examinations in China, including National Medical Licensing Examination (NMLE), National Pharmacist Licensing Examination (NPLE), and National Nurse Licensing Examination (NNLE).
METHODS: We collected questions from Chinese NMLE, NPLE and NNLE from year 2017 to 2021. In NMLE and NPLE, each exam consists of 4 units, while in NNLE, each exam consists of 2 units. The questions with figures, tables or chemical structure were manually identified and excluded by clinician. We applied direct instruction strategy via multiple prompts to force ChatGPT to generate the clear answer with the capability to distinguish between single-choice and multiple-choice questions.
RESULTS: ChatGPT failed to pass the accuracy threshold of 0.6 in any of the three types of examinations over the five years. Specifically, in the NMLE, the highest recorded accuracy was 0.5467, which was attained in both 2018 and 2021. In the NPLE, the highest accuracy was 0.5599 in 2017. In the NNLE, the most impressive result was shown in 2017, with an accuracy of 0.5897, which is also the highest accuracy in our entire evaluation. ChatGPT\'s performance showed no significant difference in different units, but significant difference in different question types. ChatGPT performed well in a range of subject areas, including clinical epidemiology, human parasitology, and dermatology, as well as in various medical topics such as molecules, health management and prevention, diagnosis and screening.
CONCLUSIONS: These results indicate ChatGPT failed the NMLE, NPLE and NNLE in China, spanning from year 2017 to 2021. but show great potential of large language models in medical education. In the future high-quality medical data will be required to improve the performance.
摘要:
背景:像ChatGPT这样的大型语言模型以其理解和生成文本内容的能力彻底改变了自然语言处理领域,显示出在医学教育中发挥作用的巨大潜力。本研究旨在定量评估和综合分析ChatGPT在中国三种类型的国家体检中的表现。包括国家医学执照考试(NMLE),国家药师执业资格考试(NPLE),全国护士执业资格考试(NNLE)。
方法:我们从中国NMLE收集问题,NPLE和NNLE从2017年到2021年。在NMLE和NPLE中,每个考试由4个单元组成,而在NNLE,每个考试由2个单元组成。带有数字的问题,表或化学结构由临床医生手动鉴定并排除.我们通过多个提示应用直接指导策略,以迫使ChatGPT生成清晰的答案,并能够区分单选题和多选题。
结果:ChatGPT在五年内的三种检查中的任何一种都未能通过0.6的准确性阈值。具体来说,在NMLE中,记录的最高准确度为0.5467,在2018年和2021年均达到。在NPLE,2017年最高精度为0.5599。在NNLE,最令人印象深刻的结果是在2017年,准确率为0.5897,这也是我们整个评估中最高的准确率。ChatGPT的性能在不同单位中没有显着差异,但不同题型存在显著差异。ChatGPT在一系列学科领域表现良好,包括临床流行病学,人类寄生虫学,和皮肤病学,以及分子等各种医学主题,健康管理和预防,诊断和筛查。
结论:这些结果表明ChatGPT未能通过NMLE,NPLE和NNLE在中国,从2017年到2021年。但是大型语言模型在医学教育中显示出巨大的潜力。将来,将需要高质量的医疗数据来提高性能。
公众号