Bing

  • 文章类型: Journal Article
    背景技术放射学领域依赖于医学图像的准确解释以用于有效诊断和患者护理。人工智能(AI)和自然语言处理的最新进展引发了人们对探索AI模型在协助放射科医生方面的潜力的兴趣。然而,已经进行了有限的研究来评估人工智能模型在放射学病例解释中的性能,特别是与人类专家相比。目的本研究旨在评估ChatGPT的性能,谷歌吟游诗人,和Bing通过将他们的回答与两名放射科居民提供的回答进行比较来解决放射科病例小插曲(皇家放射科医学院2A[FRCR2A]考试风格问题的奖学金)。方法根据FRCR2A检查的模式,制定120个基于放射学病例插图的多项选择题。这些问题被提交给了ChatGPT,谷歌吟游诗人,还有Bing.两个居民在3个小时内用相同的问题写了考试。收集AI模型生成的响应,并将其与答案键进行比较,并由两位放射科医生对答案的解释进行评级。设定60%的截止值作为及格分数。结果两名居民(63.33%和57.5%)的表现优于三个AI模型:Bard(44.17%),必(53.33%),和ChatGPT(45%),但是只有一名居民通过了考试。五名受访者的反应模式存在显着差异(p=0.0117)。此外,生成式AI模型之间的一致性是显著的(组内相关系数[ICC]=0.628),但是居民之间没有协议(Kappa=-0.376)。支持答案的生成AI模型的解释准确率为44.72%。结论与AI模型相比,人类表现出更高的准确性,展示了对主题的更强理解。研究中包含的所有三个AI模型均无法达到通过FRCR2A检查所需的最低百分比。然而,生成AI模型在他们的答案中显示出显著的一致性,其中居民表现出很低的一致性,强调他们的回答缺乏一致性。
    Background  The field of radiology relies on accurate interpretation of medical images for effective diagnosis and patient care. Recent advancements in artificial intelligence (AI) and natural language processing have sparked interest in exploring the potential of AI models in assisting radiologists. However, limited research has been conducted to assess the performance of AI models in radiology case interpretation, particularly in comparison to human experts. Objective  This study aimed to evaluate the performance of ChatGPT, Google Bard, and Bing in solving radiology case vignettes (Fellowship of the Royal College of Radiologists 2A [FRCR2A] examination style questions) by comparing their responses to those provided by two radiology residents. Methods  A total of 120 multiple-choice questions based on radiology case vignettes were formulated according to the pattern of FRCR2A examination. The questions were presented to ChatGPT, Google Bard, and Bing. Two residents wrote the examination with the same questions in 3 hours. The responses generated by the AI models were collected and compared to the answer keys and explanation of the answers was rated by the two radiologists. A cutoff of 60% was set as the passing score. Results  The two residents (63.33 and 57.5%) outperformed the three AI models: Bard (44.17%), Bing (53.33%), and ChatGPT (45%), but only one resident passed the examination. The response patterns among the five respondents were significantly different ( p  = 0.0117). In addition, the agreement among the generative AI models was significant (intraclass correlation coefficient [ICC] = 0.628), but there was no agreement between the residents (Kappa = -0.376). The explanation of generative AI models in support of answer was 44.72% accurate. Conclusion  Humans exhibited superior accuracy compared to the AI models, showcasing a stronger comprehension of the subject matter. All three AI models included in the study could not achieve the minimum percentage needed to pass an FRCR2A examination. However, generative AI models showed significant agreement in their answers where the residents exhibited low agreement, highlighting a lack of consistency in their responses.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景技术大型语言模型(LLM)已经成为能够处理和生成类似人类文本的强大工具。这些LLM,例如ChatGPT(OpenAI公司,任务区,旧金山,美国),谷歌吟游诗人(AlphabetInc.,CA,US),和微软必应(微软公司,WA,US),已经应用于各个领域,展示他们在帮助解决复杂任务和提高信息可访问性方面的潜力。然而,它们在解决生理学案例中的应用还没有被探索。本研究旨在评估三个LLM的绩效,即,ChatGPT(3.5;免费研究版),谷歌吟游诗人(实验),和微软Bing(精确),回答生理学中的小插曲。方法本横断面研究于2023年7月进行。由两名生理学家准备了总共77个生理学案例,并由另外两名内容专家进行了验证。这些案例提交给每个LLM,并收集了他们的回答。两位生理学家根据LLM的准确性对LLM提供的答案进行了独立评估。根据观察到的学习结果的结构(前结构=0,单结构=1,多结构=2,关系=3,扩展抽象),以0至4的等级进行测量。通过弗里德曼检验比较LLM之间的分数,并通过组内相关系数(ICC)检查观察者之间的一致性。结果ChatGPT的总分,宾,和研究中的Bard,总共77例,分别为3.19±0.3、2.15±0.6和2.91±0.5,p<0.0001。因此,ChatGPT3.5(免费版)获得了最高分,Bing(精确)得分最低,和巴德(实验)在性能方面介于两者之间。ChatGPT的平均ICC值,宾,和巴德为0.858(95%CI:0.777至0.91,p<0.0001),0.975(95%CI:0.961至0.984,p<0.0001),和0.964(95%CI:0.944至0.977,p<0.0001),分别。结论ChatGPT在回答生理学中的案例插图方面优于Bard和Bing。因此,学生和老师可能会考虑为他们的教育目的选择LLM,以进行基于案例的生理学学习。需要进一步探索他们的能力,以采用医学教育和临床决策支持。
    Background Large language models (LLMs) have emerged as powerful tools capable of processing and generating human-like text. These LLMs, such as ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States), Google Bard (Alphabet Inc., CA, US), and Microsoft Bing (Microsoft Corporation, WA, US), have been applied across various domains, demonstrating their potential to assist in solving complex tasks and improving information accessibility. However, their application in solving case vignettes in physiology has not been explored. This study aimed to assess the performance of three LLMs, namely, ChatGPT (3.5; free research version), Google Bard (Experiment), and Microsoft Bing (precise), in answering cases vignettes in Physiology. Methods This cross-sectional study was conducted in July 2023. A total of 77 case vignettes in physiology were prepared by two physiologists and were validated by two other content experts. These cases were presented to each LLM, and their responses were collected. Two physiologists independently rated the answers provided by the LLMs based on their accuracy. The ratings were measured on a scale from 0 to 4 according to the structure of the observed learning outcome (pre-structural = 0, uni-structural = 1, multi-structural = 2, relational = 3, extended-abstract). The scores among the LLMs were compared by Friedman\'s test and inter-observer agreement was checked by the intraclass correlation coefficient (ICC). Results The overall scores for ChatGPT, Bing, and Bard in the study, with a total of 77 cases, were found to be 3.19±0.3, 2.15±0.6, and 2.91±0.5, respectively, p<0.0001. Hence, ChatGPT 3.5 (free version) obtained the highest score, Bing (Precise) had the lowest score, and Bard (Experiment) fell in between the two in terms of performance. The average ICC values for ChatGPT, Bing, and Bard were 0.858 (95% CI: 0.777 to 0.91, p<0.0001), 0.975 (95% CI: 0.961 to 0.984, p<0.0001), and 0.964 (95% CI: 0.944 to 0.977, p<0.0001), respectively. Conclusion ChatGPT outperformed Bard and Bing in answering case vignettes in physiology. Hence, students and teachers may think about choosing LLMs for their educational purposes accordingly for case-based learning in physiology. Further exploration of their capabilities is needed for adopting those in medical education and support for clinical decision-making.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号