目的:自身免疫性肝病(AILDs)很少见,需要精确评估,这对医疗提供者来说通常是具有挑战性的。Chatbots是帮助医疗保健专业人员进行临床管理的创新解决方案。在我们的研究中,十位肝脏专家系统地评估了四个聊天机器人,以确定它们在AILD领域作为临床决策支持工具的实用性。
方法:我们构建了一个56个问题的问卷,重点是AILD评估,诊断,自身免疫性肝炎(AIH)的管理,原发性胆道胆管炎(PBC),原发性硬化性胆管炎(PSC)。四个聊天机器人-ChatGPT3.5,克劳德,MicrosoftCopilot,和GoogleBard-于2023年12月在其免费级别中提供了这些问题。使用标准化的1至10李克特量表,由十名肝脏专家对反应进行了严格评估。分析包括平均得分,评级最高的答复数量,以及聊天机器人性能中常见缺点的识别。
结果:在评估的聊天机器人中,专家对克劳德的评分最高,平均得分为7.37(SD=1.91),其次是ChatGPT(7.17,SD=1.89),MicrosoftCopilot(6.63,SD=2.10),和谷歌吟游诗人(6.52,SD=2.27)。克劳德还出色地获得了27份最佳答复,表现优于ChatGPT(20),而微软Copilot和谷歌巴德分别只有6和9。常见的缺陷包括列出细节而不是具体建议,剂量选择有限,怀孕患者的错误,近期数据不足,过度依赖CT和MRI成像,以及关于PBC治疗中的非标签使用和贝特类药物的讨论不足。值得注意的是,与预先训练的模型相比,MicrosoftCopilot和GoogleBard的互联网访问没有提高精度。
结论:聊天机器人在AILD支持中拥有承诺,但是我们的研究强调了需要改进的关键领域。在提供具体建议时需要改进,准确度,集中最新信息。解决这些缺点对于提高聊天机器人在AILD管理中的效用至关重要,指导未来发展,并确保其作为临床决策支持工具的有效性。
OBJECTIVE: Autoimmune liver diseases (AILDs) are rare and require precise evaluation, which is often challenging for medical providers.
Chatbots are innovative solutions to assist healthcare professionals in clinical management. In our study, ten liver specialists systematically evaluated four
chatbots to determine their utility as clinical decision support tools in the field of AILDs.
METHODS: We constructed a 56-question questionnaire focusing on AILD evaluation, diagnosis, and management of Autoimmune Hepatitis (AIH), Primary Biliary Cholangitis (PBC), and Primary Sclerosing Cholangitis (PSC). Four
chatbots -ChatGPT 3.5, Claude, Microsoft Copilot, and Google Bard- were presented with the questions in their free tiers in December 2023. Responses underwent critical evaluation by ten liver specialists using a standardized 1 to 10 Likert scale. The analysis included mean scores, the number of highest-rated replies, and the identification of common shortcomings in chatbots performance.
RESULTS: Among the assessed
chatbots, specialists rated Claude highest with a mean score of 7.37 (SD = 1.91), followed by ChatGPT (7.17, SD = 1.89), Microsoft Copilot (6.63, SD = 2.10), and Google Bard (6.52, SD = 2.27). Claude also excelled with 27 best-rated replies, outperforming ChatGPT (20), while Microsoft Copilot and Google Bard lagged with only 6 and 9, respectively. Common deficiencies included listing details over specific advice, limited dosing options, inaccuracies for pregnant patients, insufficient recent data, over-reliance on CT and MRI imaging, and inadequate discussion regarding off-label use and fibrates in PBC treatment. Notably, internet access for Microsoft Copilot and Google Bard did not enhance precision compared to pre-trained models.
CONCLUSIONS: Chatbots hold promise in AILD support, but our study underscores key areas for improvement. Refinement is needed in providing specific advice, accuracy, and focused up-to-date information. Addressing these shortcomings is essential for enhancing the utility of
chatbots in AILD management, guiding future development, and ensuring their effectiveness as clinical decision-support tools.