关键词: endocrine resident education surgical oncology thyroid

来  源:   DOI:10.1177/00031348241269430

Abstract:
BACKGROUND: Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications.
OBJECTIVE: This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma.
METHODS: We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale: 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5).
RESULTS: In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, P = 0.123), Correctness (P = 0.177), or Accuracy (P = 0.891).[Formula: see text].
CONCLUSIONS: ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.
摘要:
背景:人工智能(AI)已成为提供医疗保健的有前途的工具。ChatGPT-4.0(OpenAI,旧金山,加利福尼亚)和Llama2(Meta,MenloPark,CA)在各种医疗应用中的使用都受到了关注。
目的:本研究旨在评估和比较ChatGPT-4.0和Llama2在甲状腺癌诊断和治疗中辅助复杂临床决策的有效性。
方法:我们回顾了国家综合癌症网络®(NCCN)甲状腺癌管理临床实践指南,并为每个决策页面制定了多达3个复杂的临床问题。以可重复的方式查询ChatGPT-4.0和Llama2。答案是用李克特量表评分的:5)正确;4)正确,缺少需要澄清的信息;3)正确,但无法完成回答;2)部分不正确;1)绝对不正确。比较得分频率,对正确性(定义为1-2比3-5分)和准确性(1-3比4-5分)进行亚组分析。
结果:总计,58页的NCCN指南®进行了分析,生成167个独特的问题。ChatGPT-4.0和Llama2在总分方面没有统计学上的显着差异(Mann-WhitneyU检验;平均秩=160.53vs174.47,P=0.123),正确性(P=0.177),或精度(P=0.891)。[公式:见正文]。
结论:ChatGPT-4.0和Llama2显示出有限但相当大的能力来协助与甲状腺癌管理相关的复杂临床决策。其有效性没有显著差异。
公众号