关键词: Computer-assisted diagnosis Deep learning Machine learning Prevalence X-rays

来  源:   DOI:10.1007/s00330-024-10834-0

Abstract:
OBJECTIVE: This work aims to assess standard evaluation practices used by the research community for evaluating medical imaging classifiers, with a specific focus on the implications of class imbalance. The analysis is performed on chest X-rays as a case study and encompasses a comprehensive model performance definition, considering both discriminative capabilities and model calibration.
METHODS: We conduct a concise literature review to examine prevailing scientific practices used when evaluating X-ray classifiers. Then, we perform a systematic experiment on two major chest X-ray datasets to showcase a didactic example of the behavior of several performance metrics under different class ratios and highlight how widely adopted metrics can conceal performance in the minority class.
RESULTS: Our literature study confirms that: (1) even when dealing with highly imbalanced datasets, the community tends to use metrics that are dominated by the majority class; and (2) it is still uncommon to include calibration studies for chest X-ray classifiers, albeit its importance in the context of healthcare. Moreover, our systematic experiments confirm that current evaluation practices may not reflect model performance in real clinical scenarios and suggest complementary metrics to better reflect the performance of the system in such scenarios.
CONCLUSIONS: Our analysis underscores the need for enhanced evaluation practices, particularly in the context of class-imbalanced chest X-ray classifiers. We recommend the inclusion of complementary metrics such as the area under the precision-recall curve (AUC-PR), adjusted AUC-PR, and balanced Brier score, to offer a more accurate depiction of system performance in real clinical scenarios, considering metrics that reflect both, discrimination and calibration performance.
CONCLUSIONS: This study underscores the critical need for refined evaluation metrics in medical imaging classifiers, emphasizing that prevalent metrics may mask poor performance in minority classes, potentially impacting clinical diagnoses and healthcare outcomes.
CONCLUSIONS: Common scientific practices in papers dealing with X-ray computer-assisted diagnosis (CAD) systems may be misleading. We highlight limitations in reporting of evaluation metrics for X-ray CAD systems in highly imbalanced scenarios. We propose adopting alternative metrics based on experimental evaluation on large-scale datasets.
摘要:
目的:这项工作旨在评估研究界用于评估医学影像分类器的标准评估实践,特别关注阶级不平衡的影响。分析以胸部X光为案例研究,包括全面的模型性能定义,同时考虑辨别能力和模型校准。
方法:我们进行了简要的文献综述,以检查评估X射线分类器时使用的现行科学实践。然后,我们对两个主要的胸部X射线数据集进行了系统的实验,以展示几个性能指标在不同类别比率下的行为的说教性示例,并强调广泛采用的指标如何掩盖少数类别的表现.
结果:我们的文献研究证实:(1)即使在处理高度不平衡的数据集时,社区倾向于使用由多数类占主导地位的指标;和(2)它仍然是罕见的,包括校准研究的胸部X线分类器,尽管它在医疗保健方面的重要性。此外,我们的系统实验证实,当前的评估实践可能无法反映真实临床情景中的模型性能,并建议补充指标以更好地反映此类情景中系统的性能.
结论:我们的分析强调了加强评估实践的必要性,特别是在类不平衡胸部X线分类器的情况下。我们建议包括互补指标,如精确-召回曲线(AUC-PR)下的面积,调整AUC-PR,和平衡的Brier分数,为了更准确地描述真实临床场景中的系统性能,考虑到反映这两者的指标,辨别和校准性能。
结论:这项研究强调了在医学影像分类器中对精细评估指标的关键需求,强调普遍的指标可能掩盖少数族裔的糟糕表现,可能影响临床诊断和医疗保健结果。
结论:关于X射线计算机辅助诊断(CAD)系统的论文中常见的科学实践可能具有误导性。我们强调了在高度不平衡的情况下报告X射线CAD系统评估指标的局限性。我们建议在大规模数据集上采用基于实验评估的替代指标。
公众号