关键词: ChatGPT artificial intelligence machine learning musculoskeletal natural language processing orthopaedics

Mesh : Humans Cross-Sectional Studies Artificial Intelligence Reproducibility of Results Back Pain Decision Making

来  源:   DOI:10.2519/jospt.2024.12151

Abstract:
OBJECTIVE: To compare the accuracy of an artificial intelligence chatbot to clinical practice guidelines (CPGs) recommendations for providing answers to complex clinical questions on lumbosacral radicular pain. DESIGN: Cross-sectional study. METHODS: We extracted recommendations from recent CPGs for diagnosing and treating lumbosacral radicular pain. Relative clinical questions were developed and queried to OpenAI\'s ChatGPT (GPT-3.5). We compared ChatGPT answers to CPGs recommendations by assessing the (1) internal consistency of ChatGPT answers by measuring the percentage of text wording similarity when a clinical question was posed 3 times, (2) reliability between 2 independent reviewers in grading ChatGPT answers, and (3) accuracy of ChatGPT answers compared to CPGs recommendations. Reliability was estimated using Fleiss\' kappa (κ) coefficients, and accuracy by interobserver agreement as the frequency of the agreements among all judgments. RESULTS: We tested 9 clinical questions. The internal consistency of text ChatGPT answers was unacceptable across all 3 trials in all clinical questions (mean percentage of 49%, standard deviation of 15). Intrareliability (reviewer 1: κ = 0.90, standard error [SE] = 0.09; reviewer 2: κ = 0.90, SE = 0.10) and interreliability (κ = 0.85, SE = 0.15) between the 2 reviewers was \"almost perfect.\" Accuracy between ChatGPT answers and CPGs recommendations was slight, demonstrating agreement in 33% of recommendations. CONCLUSION: ChatGPT performed poorly in internal consistency and accuracy of the indications generated compared to clinical practice guideline recommendations for lumbosacral radicular pain. J Orthop Sports Phys Ther 2024;54(3):1-7. Epub 29 January 2024. doi:10.2519/jospt.2024.12151.
摘要:
目的:比较人工智能聊天机器人与临床实践指南(CPG)建议的准确性,以提供有关腰骶神经根疼痛的复杂临床问题的答案。设计:横断面研究。方法:我们从最近的CPG中提取了诊断和治疗腰骶神经根疼痛的建议。开发相关临床问题并查询OpenAI的ChatGPT(GPT-3.5)。我们通过评估(i)ChatGPT答案的内部一致性来比较ChatGPT答案与CPG建议,方法是测量临床问题提出三次时文本措辞相似性的百分比,(Ii)两名独立审稿人在对ChatGPT答案进行评分时的可靠性,(iii)与CPG建议相比,ChatGPT答案的准确性。使用Fleiss\'kappa(κ)系数估计可靠性,以及观察员之间协议的准确性,作为所有判决之间协议的频率。结果:我们测试了9个临床问题。在所有临床问题中,文本ChatGPT答案的内部一致性在所有三项试验中是不可接受的(平均百分比为49%,标准偏差为15)。内部(审阅者1:κ=0.90标准误差(SE)=0.09;审阅者2:κ=0.90se=0.10)和两个审阅者之间的间可靠性(κ=0.85SE=0.15)“几乎完美”。ChatGPT答案和CPG建议之间的准确性很小,在33%的建议中表示同意。结论:与临床实践指南推荐的腰骶神经根疼痛相比,ChatGPT在产生的适应症的内部一致性和准确性方面表现不佳。
公众号