背景:人工智能(AI)系统可以通过减轻不断增加的工作量来潜在地帮助前列腺癌的诊断途径,防止过度诊断,减少对有经验的放射科医师的依赖.我们旨在研究AI系统在MRI上检测临床上有意义的前列腺癌的性能,与放射科医生使用前列腺成像报告和数据系统2.1版(PI-RADS2.1)和多学科常规实践中的护理标准进行比较。
方法:在这个国际上,配对,非自卑,验证性研究,我们对来自9129例患者的10207例MRI检查的回顾性队列进行了培训,并对AI系统(在国际联盟内开发)进行了外部验证,该系统用于检测Gleason2级或以上的癌症.在这些考试中,来自荷兰三个中心(11个地点)的9207个案例用于培训和调音,来自荷兰和挪威的4个中心(12个地点)的1000例病例被用于检测。并行,我们促成了一个多读者,使用PI-RADS(2.1)对来自测试队列的400次成对MRI检查进行多期观察者研究,对62名放射科医师(20个国家的45个中心;平均7[IQR5-10]年的前列腺MRI阅读经验)进行研究。主要终点是灵敏度,特异性,与使用PI-RADS(2.1)的所有读取器相比,以及与多学科常规实践期间的历史放射学读数相比,AI系统的接收器工作特性曲线(AUROC)下的面积(即,借助患者病史和同行咨询的护理标准)。使用组织病理学和至少3年(中位数5[IQR4-6]年)的随访来建立参考标准。统计分析计划预先指定了非劣效性的主要假设(考虑到0·05的边际)和对AI系统的优越性的次要假设,如果非劣效性得到确认。这项研究在ClinicalTrials.gov注册,NCT05489341。
结果:在2012年1月1日至2021年12月31日的10207项检查中,2440例经组织学证实为Gleason2级或更高级别前列腺癌。在400个测试案例中,人工智能系统与参与读者研究的放射科医生进行了比较,AI系统的AUROC在统计学上为0·91(95%CI0·87-0·94;p<0·0001),与62名放射科医师的AUROC为0·86(0·83-0·89)相比,对于AUROC的差异,双侧95%WaldCI的下限为0·02。在所有阅读器的平均PI-RADS3或更高的工作点处,在相同的特异性下,AI系统检测到Gleason2级或更高的癌症病例增加了6·8%(57·7%,95%CI51·6-63·3),或在相同敏感性下,格里森1级癌症的假阳性结果减少50·4%,病例减少20·0%(89·4%,95%CI85·3-92·9)。在所有1000个测试案例中,将AI系统与多学科实践中的放射学读数进行了比较,非劣效性没有得到证实,由于AI系统在相同的灵敏度下显示出较低的特异性(68·9%[95%CI65·3-72·4]vs69·0%[65·5-72·5])(96·1%,94·0-98·2)作为PI-RADS3或更大的工作点。双侧95%WaldCI的特异性差异的下限(-0·04)大于非劣效性边缘(-0·05),并且达到了低于显著性阈值的p值(p<0·001)。
结论:AI系统优于使用PI-RADS的放射科医师(2.1),平均而言,在检测具有临床意义的前列腺癌方面,与标准护理相当。这样的系统显示出在主要诊断设置中成为支持工具的潜力。对患者和放射科医生有几个相关的好处。需要前瞻性验证来测试该系统的临床适用性。
背景:健康荷兰和欧盟地平线2020。
BACKGROUND: Artificial intelligence (AI) systems can potentially aid the diagnostic pathway of prostate cancer by alleviating the increasing workload, preventing overdiagnosis, and reducing the dependence on experienced radiologists. We aimed to investigate the performance of AI systems at detecting clinically significant prostate cancer on MRI in comparison with radiologists using the Prostate Imaging-Reporting and Data System version 2.1 (PI-RADS 2.1) and the standard of care in multidisciplinary routine practice at scale.
METHODS: In this international, paired, non-inferiority, confirmatory
study, we trained and externally validated an AI system (developed within an international consortium) for detecting Gleason grade group 2 or greater cancers using a retrospective cohort of 10 207 MRI examinations from 9129 patients. Of these examinations, 9207 cases from three centres (11 sites) based in the Netherlands were used for training and tuning, and 1000 cases from four centres (12 sites) based in the Netherlands and Norway were used for testing. In parallel, we facilitated a multireader, multicase observer
study with 62 radiologists (45 centres in 20 countries; median 7 [IQR 5-10] years of experience in reading prostate MRI) using PI-RADS (2.1) on 400 paired MRI examinations from the testing cohort. Primary endpoints were the sensitivity, specificity, and the area under the receiver operating characteristic curve (AUROC) of the AI system in comparison with that of all readers using PI-RADS (2.1) and in comparison with that of the historical radiology readings made during multidisciplinary routine practice (ie, the standard of care with the aid of patient history and peer consultation). Histopathology and at least 3 years (median 5 [IQR 4-6] years) of follow-up were used to establish the reference standard. The statistical analysis plan was prespecified with a primary hypothesis of non-inferiority (considering a margin of 0·05) and a secondary hypothesis of superiority towards the AI system, if non-inferiority was confirmed. This
study was registered at ClinicalTrials.gov, NCT05489341.
RESULTS: Of the 10 207 examinations included from Jan 1, 2012, through Dec 31, 2021, 2440 cases had histologically confirmed Gleason grade group 2 or greater prostate cancer. In the subset of 400 testing cases in which the AI system was compared with the radiologists participating in the reader
study, the AI system showed a statistically superior and non-inferior AUROC of 0·91 (95% CI 0·87-0·94; p<0·0001), in comparison to the pool of 62 radiologists with an AUROC of 0·86 (0·83-0·89), with a lower boundary of the two-sided 95% Wald CI for the difference in AUROC of 0·02. At the mean PI-RADS 3 or greater operating point of all readers, the AI system detected 6·8% more cases with Gleason grade group 2 or greater cancers at the same specificity (57·7%, 95% CI 51·6-63·3), or 50·4% fewer false-positive results and 20·0% fewer cases with Gleason grade group 1 cancers at the same sensitivity (89·4%, 95% CI 85·3-92·9). In all 1000 testing cases where the AI system was compared with the radiology readings made during multidisciplinary practice, non-inferiority was not confirmed, as the AI system showed lower specificity (68·9% [95% CI 65·3-72·4] vs 69·0% [65·5-72·5]) at the same sensitivity (96·1%, 94·0-98·2) as the PI-RADS 3 or greater operating point. The lower boundary of the two-sided 95% Wald CI for the difference in specificity (-0·04) was greater than the non-inferiority margin (-0·05) and a p value below the significance threshold was reached (p<0·001).
CONCLUSIONS: An AI system was superior to radiologists using PI-RADS (2.1), on average, at detecting clinically significant prostate cancer and comparable to the standard of care. Such a system shows the potential to be a supportive tool within a primary diagnostic setting, with several associated benefits for patients and radiologists. Prospective validation is needed to test clinical applicability of this system.
BACKGROUND: Health~Holland and EU Horizon 2020.