GPT - 4V 在回答日本耳鼻喉科委员会认证考试问题中的表现：评估研究。Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams.
OBJECTIVE: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination.
METHODS: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined.
RESULTS: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02).
CONCLUSIONS: Examination of artificial intelligence\'s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.

摘要：

背景：人工智能模型可以从医学文献和临床病例中学习，并产生与人类专家相媲美的答案。然而,在分析包含图像和图表的复杂数据方面仍然存在挑战。
目的：本研究旨在评估ChatGPT-4Vision（GPT-4V）对一组100个问题的回答能力和准确性，包括基于图像的问题，来自2023年耳鼻喉科委员会认证考试。
方法：回答来自2023年耳鼻喉科委员会认证考试的100个问题，包括基于图像的问题，使用GPT-4V产生。使用不同的提示评估准确率，和图像的存在，临床领域的问题，并检查了答案内容的变化。
结果：纯文本输入的准确率为，平均而言,24.7%，但增加了英语翻译和提示，提高到47.3%(P<.001)。纯文本输入的平均无响应率为46.3％；加上英文翻译和提示（P<.001），这一比例降至2.7％。在所有类型的输入中，基于图像的问题的准确率低于纯文本问题。相对较高的无反应率。头颈部过敏和鼻腔过敏领域的一般问题和问题具有相对较高的准确率，随着翻译和提示的增加而增加。在内容方面,与解剖学相关的问题准确率最高。对于所有内容类型，翻译和提示的增加提高了准确率。至于基于图像的问题的性能，纯文本输入的平均正确回答率为30.4%，输入文本加图像的比例为41.3%(P=.02)。
结论：对耳鼻喉科委员会认证考试的人工智能答题能力的检查提高了我们对其在该领域的潜力和局限性的理解。尽管随着翻译和提示的增加而注意到了改进，基于图像的问题的准确率低于基于文本的问题，这表明GPT-4V在这一阶段还有改进的空间。此外，文本加图像输入在基于图像的问题中回答更高的比率。我们的发现暗示了GPT-4V在医学中的有用性和潜力；然而，未来需要考虑安全使用方法。