目的:调查人工智能(AI)在放射治疗领域的应用的研究在质量方面表现出很大的差异。这项研究的目的是评估评分文章的透明度和偏见,特别关注基于AI的细分和治疗计划。使用修改后的PROBAST和TRIPOD检查表,以便为未来的指南开发人员和审阅者提供建议。
方法:使用Delphi过程讨论和修改了TRIPOD和PROBAST检查表项目。达成共识后,2组3合著者对2篇文章进行了评分,以评估可用性并进一步优化调整后的清单。最后,所有合著者对10篇文章进行了评分。计算Fleiss\'kappa以评估观察者之间协议的可靠性。
结果:37个TRIPOD项目中的3个和32个PROBAST项目中的5个被认为是不相关的。项目中的一般术语(例如,多变量预测模型,预测因子)被修改为与AI特定术语一致。第一轮得分后,制定了进一步改进的项目,例如,通过防止使用子问题或主观词,并添加关于如何评分项目的澄清。使用最终共识列表对10篇文章进行评分,在61个项目中,只有2个项目的kappa在0.4或更高的统计学意义上显示出实质性的一致性.对于41个项目,未获得统计学上显着的κ,表明多个观察者之间的一致性水平仅归因于偶然。
结论:我们的研究显示,采用适应的TRIPOD和PROBAST检查表的可靠性得分较低。尽管这些清单在开发和报告过程中显示出巨大的价值,这引起了人们对此类清单对AI应用的科学文章进行客观评分的适用性的担忧。在制定或修订准则时,在不引入偏见的情况下,考虑它们对文章的适用性是至关重要的。
OBJECTIVE: Studies investigating the application of Artificial Intelligence (AI) in the field of radiotherapy exhibit substantial variations in terms of quality. The goal of this study was to assess the amount of transparency and bias in scoring articles with a specific focus on AI based segmentation and treatment planning, using modified PROBAST and TRIPOD
checklists, in order to provide recommendations for future guideline developers and reviewers.
METHODS: The TRIPOD and PROBAST checklist items were discussed and modified using a Delphi process. After consensus was reached, 2 groups of 3 co-authors scored 2 articles to evaluate usability and further optimize the adapted
checklists. Finally, 10 articles were scored by all co-authors. Fleiss\' kappa was calculated to assess the reliability of agreement between observers.
RESULTS: Three of the 37 TRIPOD items and 5 of the 32 PROBAST items were deemed irrelevant. General terminology in the items (e.g., multivariable prediction model, predictors) was modified to align with AI-specific terms. After the first scoring round, further improvements of the items were formulated, e.g., by preventing the use of sub-questions or subjective words and adding clarifications on how to score an item. Using the final consensus list to score the 10 articles, only 2 out of the 61 items resulted in a statistically significant kappa of 0.4 or more demonstrating substantial agreement. For 41 items no statistically significant kappa was obtained indicating that the level of agreement among multiple observers is due to chance alone.
CONCLUSIONS: Our study showed low reliability scores with the adapted TRIPOD and PROBAST
checklists. Although such
checklists have shown great value during development and reporting, this raises concerns about the applicability of such
checklists to objectively score scientific articles for AI applications. When developing or revising guidelines, it is essential to consider their applicability to score articles without introducing bias.