关键词: ChatGPT artificial intelligence debate deep-learning machine-learning

Mesh : Urology / education Humans Internship and Residency Artificial Intelligence Reproducibility of Results

来  源:   DOI:10.1089/end.2023.0413

Abstract:
Background/Aim: To evaluate the performance of Chat Generative Pre-trained Transformer (ChatGPT), a large language model trained by Open artificial intelligence. Materials and Methods: This study has three main steps to evaluate the effectiveness of ChatGPT in the urologic field. The first step involved 35 questions from our institution\'s experts, who have at least 10 years of experience in their fields. The responses of ChatGPT versions were qualitatively compared with the responses of urology residents to the same questions. The second step assesses the reliability of ChatGPT versions in answering current debate topics. The third step was to assess the reliability of ChatGPT versions in providing medical recommendations and directives to patients\' commonly asked questions during the outpatient and inpatient clinic. Results: In the first step, version 4 provided correct answers to 25 questions out of 35 while version 3.5 provided only 19 (71.4% vs 54%). It was observed that residents in their last year of education in our clinic also provided a mean of 25 correct answers, and 4th year residents provided a mean of 19.3 correct responses. The second step involved evaluating the response of both versions to debate situations in urology, and it was found that both versions provided variable and inappropriate results. In the last step, both versions had a similar success rate in providing recommendations and guidance to patients based on expert ratings. Conclusion: The difference between the two versions of the 35 questions in the first step of the study was thought to be due to the improvement of ChatGPT\'s literature and data synthesis abilities. It may be a logical approach to use ChatGPT versions to inform the nonhealth care providers\' questions with quick and safe answers but should not be used to as a diagnostic tool or make a choice among different treatment modalities.
摘要:
背景/目的:为了评估聊天生成预训练变换器(ChatGPT)的性能,开放人工智能训练的大型语言模型。材料和方法:本研究有三个主要步骤来评估ChatGPT在泌尿外科领域的有效性。第一步涉及我们机构专家的35个问题,他们在他们的领域至少有10年的经验。将ChatGPT版本的回答与泌尿科居民对相同问题的回答进行了定性比较。第二步评估ChatGPT版本在回答当前辩论主题时的可靠性。第三步是评估ChatGPT版本在门诊和住院期间向患者提供医疗建议和指示的可靠性。结果:第一步,版本4为35个问题中的25个提供了正确答案,而版本3.5仅提供了19个(71.4%vs54%)。据观察,在我们诊所接受教育的最后一年的居民也提供了25个正确答案的平均值,4年的居民提供了19.3个正确答案的平均值。第二步涉及评估两种版本对泌尿科辩论情况的反应,发现这两个版本都提供了变量和不适当的结果。在最后一步,根据专家评分,两种版本在向患者提供建议和指导方面的成功率相似.结论:研究第一步中35个问题的两个版本之间的差异被认为是由于ChatGPT的文献和数据综合能力的提高。使用ChatGPT版本以快速和安全的答案告知非医疗保健提供者的问题可能是一种合乎逻辑的方法,但不应用作诊断工具或在不同的治疗方式中做出选择。
公众号