评估用于自动报告和数据系统分类的大型语言模型：横断面研究。Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

BACKGROUND: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored.
OBJECTIVE: This study aims to evaluate 3 large language model chatbots-Claude-2, GPT-3.5, and GPT-4-on assigning RADS categories to radiology reports and assess the impact of different prompting strategies.
METHODS: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots\' response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ.
RESULTS: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2\'s performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018.
CONCLUSIONS: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria.

摘要：

背景：大型语言模型显示出改善放射学工作流程的希望，但是它们在结构化放射任务（例如报告和数据系统（RADS）分类）上的表现仍未得到探索。
目的：本研究旨在评估3个大型语言模型聊天机器人-Claude-2、GPT-3.5和GPT-4-在放射学报告中分配RADS类别并评估不同提示策略的影响。
方法：这项横断面研究使用30个放射学报告（每个RADS标准10个）比较了3个聊天机器人，使用3级提示策略：零射，几枪，和指南PDF信息提示。这些病例的基础是2018年肝脏影像学报告和数据系统（LI-RADS），2022年肺部CT（计算机断层扫描）筛查报告和数据系统（Lung-RADS）和卵巢附件报告和数据系统（O-RADS）磁共振成像，由董事会认证的放射科医生精心准备。每份报告都进行了6次评估。两名失明的评论者评估了聊天机器人在患者级RADS分类和总体评级方面的反应。使用Fleissκ评估了跨重复的协议。
结果：克劳德-2在总体评分中获得了最高的准确性，其中少量提示和指南PDF（提示-2），在6次运行中获得57%(17/30)的平均准确率，在k-pass投票中获得50%(15/30)的准确率。没有及时的工程，所有聊天机器人都表现不佳。结构化示例提示（prompt-1）的引入提高了所有聊天机器人整体评分的准确性。提供prompt-2进一步改进了Claude-2的性能，GPT-4未复制的增强。TheinterrunagreementwassubstantialforClaude-2(k=0.66foroverallratingandk=0.69forRADScategorization),对于GPT-4来说是公平的(两者的k=0.39)，对于GPT-3.5来说是公平的(总体评分k=0.21，RADS分类k=0.39)。与Lung-RADS版本2022和O-RADS相比，2018年的所有聊天机器人均显示出更高的准确性（P<0.05）;在2018年LI-RADS版本中，使用prompt-2，Claude-2实现了75％（45/60）的最高总体评分准确性。
结论：当配备结构化提示和指导PDF时，Claude-2显示了根据既定标准（如LI-RADS版本2018）将RADS类别分配给放射学病例的潜力。然而,当前一代的聊天机器人滞后于根据最新的RADS标准对案件进行准确分类。