关键词: Artificial intelligence Clinical guidelines Degenerative spondylolisthesis Large language models Spine

来  源:   DOI:10.1007/s00586-024-08198-6

Abstract:
BACKGROUND: Clinical guidelines, developed in concordance with the literature, are often used to guide surgeons\' clinical decision making. Recent advancements of large language models and artificial intelligence (AI) in the medical field come with exciting potential. OpenAI\'s generative AI model, known as ChatGPT, can quickly synthesize information and generate responses grounded in medical literature, which may prove to be a useful tool in clinical decision-making for spine care. The current literature has yet to investigate the ability of ChatGPT to assist clinical decision making with regard to degenerative spondylolisthesis.
OBJECTIVE: The study aimed to compare ChatGPT\'s concordance with the recommendations set forth by The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and assess ChatGPT\'s accuracy within the context of the most recent literature.
METHODS: ChatGPT-3.5 and 4.0 was prompted with questions from the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and graded its recommendations as \"concordant\" or \"nonconcordant\" relative to those put forth by NASS. A response was considered \"concordant\" when ChatGPT generated a recommendation that accurately reproduced all major points made in the NASS recommendation. Any responses with a grading of \"nonconcordant\" were further stratified into two subcategories: \"Insufficient\" or \"Over-conclusive,\" to provide further insight into grading rationale. Responses between GPT-3.5 and 4.0 were compared using Chi-squared tests.
RESULTS: ChatGPT-3.5 answered 13 of NASS\'s 28 total clinical questions in concordance with NASS\'s guidelines (46.4%). Categorical breakdown is as follows: Definitions and Natural History (1/1, 100%), Diagnosis and Imaging (1/4, 25%), Outcome Measures for Medical Intervention and Surgical Treatment (0/1, 0%), Medical and Interventional Treatment (4/6, 66.7%), Surgical Treatment (7/14, 50%), and Value of Spine Care (0/2, 0%). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-3.5 generated a concordant response 66.7% of the time (6/9). However, ChatGPT-3.5\'s concordance dropped to 36.8% when asked clinical questions that NASS did not provide a clear recommendation on (7/19). A further breakdown of ChatGPT-3.5\'s nonconcordance with the guidelines revealed that a vast majority of its inaccurate recommendations were due to them being \"over-conclusive\" (12/15, 80%), rather than \"insufficient\" (3/15, 20%). ChatGPT-4.0 answered 19 (67.9%) of the 28 total questions in concordance with NASS guidelines (P = 0.177). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-4.0 generated a concordant response 66.7% of the time (6/9). ChatGPT-4.0\'s concordance held up at 68.4% when asked clinical questions that NASS did not provide a clear recommendation on (13/19, P = 0.104).
CONCLUSIONS: This study sheds light on the duality of LLM applications within clinical settings: one of accuracy and utility in some contexts versus inaccuracy and risk in others. ChatGPT was concordant for most clinical questions NASS offered recommendations for. However, for questions NASS did not offer best practices, ChatGPT generated answers that were either too general or inconsistent with the literature, and even fabricated data/citations. Thus, clinicians should exercise extreme caution when attempting to consult ChatGPT for clinical recommendations, taking care to ensure its reliability within the context of recent literature.
摘要:
背景:临床指南,与文献一致发展,通常用于指导外科医生的临床决策。医学领域的大型语言模型和人工智能(AI)的最新进展具有令人兴奋的潜力。OpenAI的生成AI模型,被称为ChatGPT,可以快速综合信息并产生基于医学文献的反应,这可能被证明是脊柱护理临床决策的有用工具。目前的文献尚未研究ChatGPT协助退行性腰椎滑脱的临床决策的能力。
目的:该研究旨在比较ChatGPT与北美脊柱学会(NASS)关于退行性脊椎滑脱的诊断和治疗的临床指南的建议的一致性,并在最新文献的背景下评估ChatGPT的准确性。
方法:ChatGPT-3.5和4.0提示了NASS关于退行性脊椎滑脱诊断和治疗临床指南的问题,并将其建议分级为“一致”或“不一致”。当ChatGPT产生的建议准确地再现了NASS建议中提出的所有主要观点时,反应被认为是“一致的”。任何等级为“不一致”的答复都被进一步分为两个子类别:“不足”或“结论过高,\“提供对评分基本原理的进一步见解。使用卡方检验比较GPT-3.5和4.0之间的反应。
结果:ChatGPT-3.5回答了符合NASS指南的28个临床问题中的13个(46.4%)。分类分类如下:定义和自然史(1/1,100%),诊断和成像(1/4,25%),医学干预和手术治疗的结果措施(0/1,0%),医疗和介入治疗(4/6,66.7%),手术治疗(7/14,50%),和脊柱护理的价值(0/2,0%)。当NASS表明有足够的证据提供明确的建议时,ChatGPT-3.5在66.7%的时间内产生一致反应(6/9)。然而,当被问及NASS没有提供明确建议的临床问题时,ChatGPT-3.5的一致性降至36.8%(7/19)。对ChatGPT-3.5与指南不一致的进一步细分显示,其绝大多数不准确的建议是由于它们“过于结论性”(12/15,80%),而不是“不足”(3/15,20%)。ChatGPT-4.0回答了与NASS指南一致的28个问题中的19个(67.9%)(P=0.177)。当NASS表明有足够的证据提供明确的建议时,ChatGPT-4.0在66.7%的时间内产生一致反应(6/9)。当询问NASS未提供明确建议的临床问题时,ChatGPT-4.0的一致性保持在68.4%(13/19,P=0.104)。
结论:这项研究揭示了临床环境中LLM应用的双重性:在某些情况下的准确性和实用性与在其他情况下的不准确性和风险之一。ChatGPT与NASS提供的大多数临床问题一致。然而,对于NASS没有提供最佳实践的问题,ChatGPT产生的答案要么过于笼统,要么与文献不一致,甚至捏造的数据/引用。因此,临床医生在尝试咨询ChatGPT临床建议时应格外谨慎,在最近的文献中注意确保其可靠性。
公众号