关键词: audiology chatgpt clinical electrophysiology consistency hearing assessment large language model repeatability

来  源:   DOI:10.7759/cureus.59857   PDF(Pubmed)

Abstract:
BACKGROUND: ChatGPT has been tested in many disciplines, but only a few have involved hearing diagnosis and none to physiology or audiology more generally. The consistency of the chatbot\'s responses to the same question posed multiple times has not been well investigated either. This study aimed to assess the accuracy and repeatability of ChatGPT 3.5 and 4 on test questions concerning objective measures of hearing. Of particular interest was the short-term repeatability of responses which was here tested on four separate days extended over one week.
METHODS: We used 30 single-answer, multiple-choice exam questions from a one-year course on objective methods of testing hearing. The questions were posed five times to both ChatGPT 3.5 (the free version) and ChatGPT 4 (the paid version) on each of four days (two days one week and two days the following week). The accuracy of the responses was evaluated in terms of a response key. To evaluate the repeatability of the responses over time, percent agreement and Cohen\'s Kappa were calculated.  Results: The overall accuracy of ChatGPT 3.5 was 48-49%, while that of ChatGPT 4 was 65-69%. ChatGPT 3.5 consistently failed to pass the threshold of 50% correct responses. Within a single day, the percent agreement was 76-79% for ChatGPT 3.5 and 87-88% for ChatGPT 4 (Cohen\'s Kappa 0.67-0.71 and 0.81-0.84 respectively). The percent agreement between responses from different days was 75-79% for ChatGPT 3.5 and 85-88% for ChatGPT 4 (Cohen\'s Kappa 0.65-0.69 and 0.80-0.85 respectively).
CONCLUSIONS: ChatGPT 4 outperforms ChatGPT 3.5 both in accuracy and higher repeatability over time. However, the great variability of the responses casts doubt on possible professional applications of both versions.
摘要:
背景:ChatGPT已经在许多学科中进行了测试,但只有少数涉及听力诊断,没有涉及生理学或听力学。聊天机器人对多次提出的同一问题的回答的一致性也没有得到很好的调查。这项研究旨在评估ChatGPT3.5和4在有关客观听力测量的测试问题上的准确性和可重复性。特别感兴趣的是反应的短期可重复性,其在此在延长一周的四个单独的天进行测试。
方法:我们使用了30个单一答案,来自为期一年的客观听力测试方法课程的多项选择题。这些问题在四天(一周两天,下周两天)中的每一天都向ChatGPT3.5(免费版本)和ChatGPT4(付费版本)提出了五次。根据响应键评估响应的准确性。为了评估响应随时间的可重复性,计算了百分比协议和科恩的卡帕。结果:ChatGPT3.5的总体准确性为48-49%,而ChatGPT4的比例为65-69%。ChatGPT3.5始终未能通过50%正确响应的阈值。在一天之内,ChatGPT3.5的百分比一致性为76-79%,ChatGPT4的百分比一致性为87-88%(Cohen的Kappa分别为0.67-0.71和0.81-0.84)。不同天的反应之间的一致性百分比为ChatGPT3.5的75-79%和ChatGPT4的85-88%(Cohen的Kappa分别为0.65-0.69和0.80-0.85)。
结论:ChatGPT4在准确性和更高的可重复性方面均优于ChatGPT3.5。然而,回答的巨大可变性使人们对这两个版本的可能的专业应用产生了怀疑。
公众号