目的:研究ChatGPT在评估YouTube上最受关注的泌尿系癌症相关视频的医疗内容质量方面的可靠性。
方法:2024年3月,为每种类型的泌尿外科癌症在YouTube上观看的前20个视频创建了播放列表。ChatGPT和泌尿科专家使用DISCERN-5和全球质量量表(GQS)问卷对视频文本进行了评估。使用Kruskal-Wallis检验比较获得的结果。
结果:对于前列腺,膀胱,肾,和睾丸癌视频,由人类评估者和ChatGPT给出的中位数(IQR)DISCERN-5评分为(人类:4[1],3[0],3[2],3[1],P=.11;ChatGPT:3[1.75],3[1],3[2],3[0],分别为P=4)和GQS评分为(人类:4[1.75],3[0.75],3.5[2],3.5[1],P=.12;ChatGPT:4[1],3[0.75],3[1],3.5[1],分别为P=.1),得分之间没有显着差异。ChatGPT反应的可重复性被确定为与前列腺癌的25%相似。30%的膀胱癌,30%为肾癌,和35%的睾丸癌(P=0.92)。关于前列腺的视频内容,人类和ChatGPT给出的中位数(IQR)DISCERN-5和GQS得分之间没有统计学上的显着差异,膀胱,肾,和睾丸癌(P>0.05)。
结论:尽管ChatGPT在评估视频文本的医疗质量方面是成功的,应谨慎评估结果,因为结果的重复性较低.
OBJECTIVE: To examine the reliability of ChatGPT in evaluating the quality of medical content of the most watched videos related to urological cancers on YouTube.
METHODS: In March 2024 a playlist was created of the first 20 videos watched on YouTube for each type of urological cancer. The video texts were evaluated by ChatGPT and by a urology specialist using the DISCERN-5 and Global Quality Scale (GQS) questionnaires. The results obtained were compared using the Kruskal-Wallis test.
RESULTS: For the prostate, bladder, renal, and testicular cancer videos, the median (IQR) DISCERN-5 scores given by the human evaluator and ChatGPT were (Human: 4 [1], 3 [0], 3 [2], 3 [1], P = .11; ChatGPT: 3 [1.75], 3 [1], 3 [2], 3 [0], P = .4, respectively) and the GQS scores were (Human: 4 [1.75], 3 [0.75], 3.5 [2], 3.5 [1], P = .12; ChatGPT: 4 [1], 3 [0.75], 3 [1], 3.5 [1], P = .1, respectively), with no significant difference determined between the scores. The repeatability of the ChatGPT responses was determined to be similar at 25 % for prostate cancer, 30 % for bladder cancer, 30 % for renal cancer, and 35 % for testicular cancer (P = .92). No statistically significant difference was determined between the median (IQR) DISCERN-5 and GQS scores given by humans and ChatGPT for the content of videos about prostate, bladder, renal, and testicular cancer (P > .05).
CONCLUSIONS: Although ChatGPT is successful in evaluating the medical quality of video texts, the results should be evaluated with caution as the repeatability of the results is low.