分析大型语言模型对常见腰椎融合手术问题的反应： ChatGPT 和 Bard 之间的比较。Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

OBJECTIVE: In the digital age, patients turn to online sources for lumbar spine fusion information, necessitating a careful study of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT) for patient education.
METHODS: Our study aims to assess the response quality of Open AI (artificial intelligence)\'s ChatGPT 3.5 and Google\'s Bard to patient questions on lumbar spine fusion surgery. We identified 10 critical questions from 158 frequently asked ones via Google search, which were then presented to both chatbots. Five blinded spine surgeons rated the responses on a 4-point scale from \'unsatisfactory\' to \'excellent.\' The clarity and professionalism of the answers were also evaluated using a 5-point Likert scale.
RESULTS: In our evaluation of 10 questions across ChatGPT 3.5 and Bard, 97% of responses were rated as excellent or satisfactory. Specifically, ChatGPT had 62% excellent and 32% minimally clarifying responses, with only 6% needing moderate or substantial clarification. Bard\'s responses were 66% excellent and 24% minimally clarifying, with 10% requiring more clarification. No significant difference was found in the overall rating distribution between the 2 models. Both struggled with 3 specific questions regarding surgical risks, success rates, and selection of surgical approaches (Q3, Q4, and Q5). Interrater reliability was low for both models (ChatGPT: k = 0.041, p = 0.622; Bard: k = -0.040, p = 0.601). While both scored well on understanding and empathy, Bard received marginally lower ratings in empathy and professionalism.
CONCLUSIONS: ChatGPT3.5 and Bard effectively answered lumbar spine fusion FAQs, but further training and research are needed to solidify LLMs\' role in medical education and healthcare communication.

摘要：

目标：在数字时代，患者转向在线来源获取腰椎融合信息，需要仔细研究大型语言模型（LLM），例如用于患者教育的聊天生成预训练变压器（ChatGPT）。
方法：我们的研究旨在评估OpenAI（人工智能）的ChatGPT3.5和Google的Bard对腰椎融合手术患者问题的响应质量。我们通过谷歌搜索从158个常见问题中找出了10个关键问题，然后将其呈现给两个聊天机器人。五名失明的脊柱外科医生以4分制对反应进行了评分，从“不满意”到“优秀”。\'答案的清晰度和专业性也使用5点李克特量表进行了评估。
结果：在我们对ChatGPT3.5和Bard的10个问题的评估中，97%的反应被评为优秀或令人满意。具体来说,ChatGPT有62%的优秀和32%的最低澄清反应，只有6%需要适度或实质性的澄清。巴德的回答是66%的优秀和24%的最低澄清，10%需要更多的澄清。2个模型之间的总体评级分布没有发现显着差异。两人都在努力解决关于手术风险的3个具体问题，成功率,以及手术入路的选择（Q3，Q4和Q5）。两种模型的评分者间可靠性均较低（ChatGPT：k=0.041，p=0.622；Bard：k=-0.040，p=0.601）。虽然两人在理解和同理心上都得分很高，吟游诗人在同理心和专业精神方面的评分略低。
结论：ChatGPT3.5和Bard有效回答了腰椎融合常见问题，但需要进一步的培训和研究来巩固LLM在医学教育和医疗保健沟通中的作用。