关键词: AI ChatGPT GPT-4 GPT-4V LLM NLP answer answers artificial intelligence chatbot chatbots conversational agent conversational agents exam examination examinations exams generative pretrained transformer image images imaging language model language models large language model medical education natural language processing response responses

Mesh : Japan Language Licensure Medicine

来  源:   DOI:10.2196/54393   PDF(Pubmed)

Abstract:
BACKGROUND: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images.
OBJECTIVE: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination.
METHODS: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test.
RESULTS: Among the 108 questions with images, GPT-4V\'s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively.
CONCLUSIONS: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination.
摘要:
背景:将大型语言模型(LLM)应用于医学的先前研究集中在基于文本的信息上。最近,LLM的多模态变体获得了识别图像的能力。
目的:我们旨在评估生成预训练变压器(GPT)-4V的图像识别能力,OpenAI最近开发的多模式LLM,在医疗领域,通过测试视觉信息如何影响其性能来回答第117次日本国家医疗执照考试中的问题。
方法:我们专注于108个问题,其中包含一个或多个图像作为问题的一部分,并在两种条件下将相同的问题呈现给GPT-4V:(1)同时包含问题文本和相关图像,以及(2)仅包含问题文本。然后,我们使用精确的McNemar测试比较了两种条件之间的准确性差异。
结果:在带有图像的108个问题中,GPT-4V的准确性是68%(73/108)时,呈现图像和72%(78/108)时,没有图像(P=.36)。对于2个问题类别,临床和一般,有和没有图像的准确度分别为71%(70/98)对78%(76/98;P=.21)和30%(3/10)对20%(2/10;P≥.99),分别。
结论:来自图像的其他信息并未显着改善GPT-4V在日本国家医学执照考试中的性能。
公众号