一种用于评估耳鼻咽喉头颈外科认证考试 ChatGPT 的新型评估模型：性能研究。A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: Performance Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology-head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported.
OBJECTIVE: We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model\'s performance on open-ended medical board examination questions.
METHODS: Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada\'s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance.
RESULTS: In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed.
CONCLUSIONS: ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation.

摘要：

背景：ChatGPT是最流行的大型语言模型（LLM）之一，在各种标准化测试中表现出熟练程度，包括多项选择的医学委员会检查。然而,其在耳鼻咽喉头颈外科（OHNS）认证考试和开放式医疗委员会认证考试中的表现尚未报告。
目的：我们旨在评估ChatGPT在OHNS板考试中的表现，并提出一种新颖的方法来评估AI模型在开放式医学板考试问题上的表现。
方法：在2023年4月11日，加拿大皇家内科医生和外科医生学院的样本检查中采用了21个开放式问题来查询ChatGPT，有提示和无提示。一个新的模型，名为和谐，有效性,安全,能力（CVSC），是为了评估其性能而开发的。
结果：在开放式问题评估中，ChatGPT在尝试中获得了通过分数（在3次试验中平均为75％），并在提示下表现出更高的准确性。该模型具有较高的一致性(92.06%)和令人满意的有效性。虽然在重新生成答案方面表现出相当大的一致性，它通常只提供部分正确的回答。值得注意的是,有关的特征，如幻觉和自我冲突的答案被观察。
结论：ChatGPT在样本检查中取得了及格分数，并证明了通过加拿大皇家内科医生和外科医生学院的OHNS认证考试的潜力。由于它的幻觉，仍然存在一些担忧，这可能会给患者安全带来风险。需要进一步调整，以便为临床实施提供更安全，更准确的答案。