关键词: monocular depth estimation normal estimate self-distillation self-supervised learning

来  源:   DOI:10.3390/s24134090   PDF(Pubmed)

Abstract:
Self-supervised monocular depth estimation can exhibit excellent performance in static environments due to the multi-view consistency assumption during the training process. However, it is hard to maintain depth consistency in dynamic scenes when considering the occlusion problem caused by moving objects. For this reason, we propose a method of self-supervised self-distillation for monocular depth estimation (SS-MDE) in dynamic scenes, where a deep network with a multi-scale decoder and a lightweight pose network are designed to predict depth in a self-supervised manner via the disparity, motion information, and the association between two adjacent frames in the image sequence. Meanwhile, in order to improve the depth estimation accuracy of static areas, the pseudo-depth images generated by the LeReS network are used to provide the pseudo-supervision information, enhancing the effect of depth refinement in static areas. Furthermore, a forgetting factor is leveraged to alleviate the dependency on the pseudo-supervision. In addition, a teacher model is introduced to generate depth prior information, and a multi-view mask filter module is designed to implement feature extraction and noise filtering. This can enable the student model to better learn the deep structure of dynamic scenes, enhancing the generalization and robustness of the entire model in a self-distillation manner. Finally, on four public data datasets, the performance of the proposed SS-MDE method outperformed several state-of-the-art monocular depth estimation techniques, achieving an accuracy (δ1) of 89% while minimizing the error (AbsRel) by 0.102 in NYU-Depth V2 and achieving an accuracy (δ1) of 87% while minimizing the error (AbsRel) by 0.111 in KITTI.
摘要:
由于训练过程中的多视图一致性假设,自监督单目深度估计可以在静态环境中表现出出色的性能。然而,在动态场景中考虑运动物体造成的遮挡问题时,很难保持深度的一致性。出于这个原因,我们提出了一种在动态场景中进行单目深度估计的自监督自蒸馏方法(SS-MDE),其中具有多尺度解码器和轻量级姿态网络的深度网络被设计为通过视差以自监督的方式预测深度,运动信息,以及图像序列中两个相邻帧之间的关联。同时,为了提高静态区域的深度估计精度,LeReS网络生成的伪深度图像用于提供伪监督信息,增强静态区域深度细化的效果。此外,利用遗忘因素来减轻对伪监督的依赖。此外,引入了教师模型来生成深度先验信息,并设计了多视图掩码滤波模块来实现特征提取和噪声滤波。这可以使学生模型更好地学习动态场景的深层结构,以自蒸馏的方式增强了整个模型的泛化性和鲁棒性。最后,在四个公共数据数据集上,所提出的SS-MDE方法的性能优于几种最先进的单目深度估计技术,实现89%的精度(δ1),同时在NYU深度V2中将误差(AbsRel)最小化0.102,并且实现87%的精度(δ1),同时在KITTI中将误差(AbsRel)最小化0.111。
公众号