评估大型语言模型对自由文本医学生临床笔记的评分能力：定量研究。Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

UNASSIGNED: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes.
UNASSIGNED: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students\' free-text history and physical notes.
UNASSIGNED: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students\' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct.
UNASSIGNED: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002).
UNASSIGNED: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students\' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.

摘要：

■教医学生获得所需的技能，解释,apply,沟通临床信息是医学教育不可或缺的一部分。此过程的一个关键方面涉及为学生提供有关其自由文本临床笔记质量的反馈。
■本研究的目标是评估大型语言模型ChatGPT3.5的能力，对医学生的自由文本历史和身体笔记进行评分。
■这是一个单一的机构，回顾性研究。标准化的患者学到了预先指定的临床病例，作为病人，与医学生互动。每个学生都写了自由文本历史和他们互动的物理笔记。学生的笔记由标准化患者和ChatGPT使用由85个案例元素组成的预先指定的评分规则进行独立评分。准确度的度量是正确的百分比。
■研究人群由168名一年级医学生组成。总共有14,280分。ChatGPT错误得分率为1.0%,标准化患者错误评分率为7.2%。ChatGPT错误率为86%,低于标准化患者错误率。ChatGPT平均不正确得分为12（SD11）显着低于标准化患者平均不正确得分为85（SD74；P=0.002）。
■与标准化患者相比，ChatGPT显示出较低的错误率。这是第一项评估生成预训练变压器（GPT）计划对医学生的标准化基于患者的免费文本临床笔记进行评分的能力的研究。预计,在不久的将来,大型语言模型将为执业医师提供有关其自由文本注释的实时反馈。GPT人工智能程序代表了医学教育和医学实践的重要进步。