关键词: Deep learning Medical image segmentation Performance comparisons Random seeds Randomness

Mesh : Deep Learning Humans Algorithms Brain Neoplasms / diagnostic imaging Image Processing, Computer-Assisted / methods Hippocampus / diagnostic imaging

来  源:   DOI:10.1016/j.compbiomed.2024.108944

Abstract:
BACKGROUND: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models.
METHODS: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on this, segmentation performance was assessed using both hold-out validation and 5-fold cross-validation and the statistical significance of performance differences was measured using the Paired t-test and the Wilcoxon signed rank test on Dice scores.
RESULTS: For the different segmentation problems, the seed producing the highest mean Dice score statistically significantly outperformed between 0 % and 76 % of the remaining seeds when estimating performance using hold-out validation, and between 10 % and 38 % when estimating performance using 5-fold cross-validation.
CONCLUSIONS: Random effects during training can cause high rates of statistically-significant performance differences between segmentation models from the same learning algorithm. Whilst statistical testing is widely used in contemporary literature, our results indicate that a statistically-significant difference in segmentation performance is a weak and unreliable indicator of a true performance difference between two learning algorithms.
摘要:
背景:单个学习算法可以产生基于深度学习的图像分割模型,这些模型纯粹是由于训练过程中的随机效应而在性能上有所不同。这项研究评估了这些随机性能波动对比较分段模型的标准方法的可靠性的影响。
方法:通过运行具有50种不同随机种子的单个学习算法(nnU-Net)来评估训练过程中随机效应的影响,以解决三个多类3D医学图像分割问题,包括脑瘤,海马体,和心脏分割。对最近的文献进行了采样,以找到用于估计和比较深度学习分割模型性能的最常用方法。基于此,使用保持验证和5倍交叉验证评估分段性能,并使用配对t检验和Wilcoxon符号秩检验对Dice评分测量性能差异的统计学意义.
结果:对于不同的分段问题,TheseedproducingthehighestmeanDicescorestatisticallyoutperformancebetween0%and76%oftheremainingseedswhenestimatingperformanceusinghold-outvalidation,使用5倍交叉验证估计性能时,在10%到38%之间。
结论:训练过程中的随机效应会导致来自相同学习算法的分割模型之间的高比率的统计上显著的性能差异。虽然统计检验在当代文学中被广泛使用,我们的研究结果表明,分割性能的统计学显著差异是两种学习算法之间真实性能差异的微弱且不可靠的指标.
公众号