关键词: BingAI ChatGPT Gemini artificial intelligence large language models retinopathy of prematurity

来  源:   DOI:10.3390/children11060750   PDF(Pubmed)

Abstract:
BACKGROUND: Large language models (LLMs) are becoming increasingly important as they are being used more frequently for providing medical information. Our aim is to evaluate the effectiveness of electronic artificial intelligence (AI) large language models (LLMs), such as ChatGPT-4, BingAI, and Gemini in responding to patient inquiries about retinopathy of prematurity (ROP).
METHODS: The answers of LLMs for fifty real-life patient inquiries were assessed using a 5-point Likert scale by three ophthalmologists. The models\' responses were also evaluated for reliability with the DISCERN instrument and the EQIP framework, and for readability using the Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), and Coleman-Liau Index.
RESULTS: ChatGPT-4 outperformed BingAI and Gemini, scoring the highest with 5 points in 90% (45 out of 50) and achieving ratings of \"agreed\" or \"strongly agreed\" in 98% (49 out of 50) of responses. It led in accuracy and reliability with DISCERN and EQIP scores of 63 and 72.2, respectively. BingAI followed with scores of 53 and 61.1, while Gemini was noted for the best readability (FRE score of 39.1) but lower reliability scores. Statistically significant performance differences were observed particularly in the screening, diagnosis, and treatment categories.
CONCLUSIONS: ChatGPT-4 excelled in providing detailed and reliable responses to ROP-related queries, although its texts were more complex. All models delivered generally accurate information as per DISCERN and EQIP assessments.
摘要:
背景:大型语言模型(LLM)正变得越来越重要,因为它们被更频繁地用于提供医疗信息。我们的目标是评估电子人工智能(AI)大型语言模型(LLM)的有效性,例如ChatGPT-4,BingAI,和双子座回答患者关于早产儿视网膜病变(ROP)的询问。
方法:三位眼科医生使用5点Likert量表评估了LLM对50项现实生活中患者询问的回答。还使用DISCERN仪器和EQIP框架评估了模型响应的可靠性,以及使用Flesch阅读方便(FRE)的可读性,Flesch-Kincaid等级(FKGL),和Coleman-Liau指数。
结果:ChatGPT-4的表现优于BingAI和双子座,在90%(50分中的45分)中得分最高,并在98%(50分中的49分)的回答中获得“同意”或“强烈同意”的评级。它的准确性和可靠性分别为DISCERN和EQIP评分为63和72.2。BingAI的得分为53和61.1,而Gemini的可读性最好(FRE得分为39.1),但可靠性得分较低。特别是在筛选中观察到统计学上显著的性能差异,诊断,和治疗类别。
结论:ChatGPT-4在对ROP相关查询提供详细和可靠的响应方面表现出色,虽然它的文本更复杂。根据DISCERN和EQIP评估,所有模型均提供了大致准确的信息。
公众号