背景胰腺导管腺癌(PDAC)的结构化放射学报告比自由文本报告改善了手术决策,但是放射科医生的采用是可变的。可切除性标准的应用不一致。目的评估大型语言模型(LLM)在从原始报告自动创建PDAC天气报告中的性能,并探索对肿瘤可切除性进行分类的性能。材料和方法在这项机构审查委员会批准的回顾性研究中,纳入了2018年1月至12月提交给作者的欧洲肿瘤医学学会指定癌症中心的180例患者的连续PDAC分期CT报告。两名放射科医生对报告进行了审查,以建立14项关键发现和国家综合癌症网络(NCCN)可切除性类别的参考标准。GPT-3.5和GPT-4(2023年9月18日至29日访问)被提示从具有相同14个特征的原始报告中创建天气报告,并对他们的表现进行了评估(召回,精度,F1得分)。为了对可切除性进行分类,三种提示策略(默认知识,背景知识,思想链)用于两种LLM。肝胰胆管外科医生审查了原始和人工智能(AI)生成的报告以确定可切除性,与准确性和复习时间进行比较。McNemar测试,t测试,Wilcoxon符号秩检验,并在适当情况下使用混合效应逻辑回归模型。结果GPT-4在制作天气报告方面优于GPT-3.5(F1评分分别为0.997和0.967)。与GPT-3.5相比,GPT-4对于所有14个提取的特征实现相等或更高的F1得分。GPT-4提取肠系膜上动脉受累的精确度高于GPT-3.5(100%vs88.8%,分别)。为了对可切除性进行分类,GPT-4在每种提示策略中的表现均优于GPT-3.5。对于GPT-4,思想链提示是最准确的,优于上下文知识提示(92%对83%,分别为;P=.002),它的表现优于默认知识策略(83%对67%,P<.001)。外科医生使用人工智能生成的报告对可切除性进行分类比原始报告更准确(83%对76%,分别;P=0.03),而在每份报告上花费的时间更少(58%;95%CI:0.53,0.62)。结论GPT-4从原始报告中创建了近乎完美的PDAC天气报告。具有思想链的GPT-4在分类可切除性方面取得了很高的准确性。外科医生使用AI生成的报告更加准确和高效。©RSNA,2024补充材料可用于本文。另见Chang在本期的社论。
Background Structured radiology
reports for pancreatic ductal adenocarcinoma (PDAC) improve surgical decision-making over free-text
reports, but radiologist adoption is variable. Resectability criteria are applied inconsistently. Purpose To evaluate the performance of large language models (LLMs) in automatically creating PDAC synoptic reports from original
reports and to explore performance in categorizing tumor resectability. Materials and Methods In this institutional review board-approved retrospective study, 180 consecutive PDAC staging CT reports on patients referred to the authors\' European Society for Medical Oncology-designated cancer center from January to December 2018 were included. Reports were reviewed by two radiologists to establish the reference standard for 14 key findings and National Comprehensive Cancer Network (NCCN) resectability category. GPT-3.5 and GPT-4 (accessed September 18-29, 2023) were prompted to create synoptic reports from original reports with the same 14 features, and their performance was evaluated (recall, precision, F1 score). To categorize resectability, three prompting strategies (default knowledge, in-context knowledge, chain-of-thought) were used for both LLMs. Hepatopancreaticobiliary surgeons reviewed original and artificial intelligence (AI)-generated reports to determine resectability, with accuracy and review time compared. The McNemar test, t test, Wilcoxon signed-rank test, and mixed effects logistic regression models were used where appropriate. Results GPT-4 outperformed GPT-3.5 in the creation of synoptic reports (F1 score: 0.997 vs 0.967, respectively). Compared with GPT-3.5, GPT-4 achieved equal or higher F1 scores for all 14 extracted features. GPT-4 had higher precision than GPT-3.5 for extracting superior mesenteric artery involvement (100% vs 88.8%, respectively). For categorizing resectability, GPT-4 outperformed GPT-3.5 for each prompting strategy. For GPT-4, chain-of-thought prompting was most accurate, outperforming in-context knowledge prompting (92% vs 83%, respectively; P = .002), which outperformed the default knowledge strategy (83% vs 67%, P < .001). Surgeons were more accurate in categorizing resectability using AI-generated
reports than original
reports (83% vs 76%, respectively; P = .03), while spending less time on each report (58%; 95% CI: 0.53, 0.62). Conclusion GPT-4 created near-perfect PDAC synoptic
reports from original
reports. GPT-4 with chain-of-thought achieved high accuracy in categorizing resectability. Surgeons were more accurate and efficient using AI-generated reports. © RSNA, 2024 Supplemental material is available for this article. See also the editorial by Chang in this issue.