SUB-Depth: Self-distillation and Uncertainty Boosting Self-supervised Monocular Depth Estimation

We propose SUB-Depth, a universal multi-task training framework for self-supervised monocular depth estimation (SDE). Depth models trained with SUB-Depth outperform the same models trained in a standard single-task SDE framework. By introducing an additional self-distillation task into a standard SDE training framework, SUB-Depth trains a depth network, not only to predict the depth map for an image reconstruction task, but also to distill knowledge from a trained teacher network with unlabelled data. To take advantage of this multi-task setting, we propose homoscedastic uncertainty formulations for each task to penalize areas likely to be affected by teacher network noise, or violate SDE assumptions. We present extensive evaluations on KITTI to demonstrate the improvements achieved by training a range of existing networks using the proposed framework, and we achieve state-of-the-art performance on this task. Additionally, SUB-Depth enables models to estimate uncertainty on depth output.

[1]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Zeeshan Khan Suri,et al.  CamLessMonoDepth: Monocular Depth Estimation with Unknown Camera Parameters , 2021, BMVC.

[3]  Yizhe Zhang,et al.  X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task Distillation , 2021, BMVC.

[4]  Hang Zhou,et al.  Self-Supervised Monocular Depth Estimation with Internal Feature Fusion , 2021, BMVC.

[5]  Sylvain Paris,et al.  Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Junmo Kim,et al.  Patch-Wise Attention Network for Monocular Depth Estimation , 2021, AAAI.

[7]  Oisin Mac Aodha,et al.  The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wenjun Zeng,et al.  S2R-DepthNet: Learning a Generalizable Depth-specific Structural Representation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jie Zhou,et al.  SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular images , 2021, Neurocomputing.

[10]  Liang Liu,et al.  HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation , 2020, AAAI.

[11]  Md. Amirul Islam,et al.  Bidirectional Attention Network for Monocular Depth Estimation , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Stefan Milz,et al.  SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation Synergized with Semantic Segmentation for Autonomous Driving , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  S. Izadi,et al.  HITNet: Hierarchical Iterative Tile Refinement Network for Real-time Stereo Matching , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hang Zhou,et al.  Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation , 2020, CVMP.

[15]  Changick Kim,et al.  SAFENet: Self-Supervised Monocular Depth Estimation with Semantic-Aware Feature Extraction , 2020 .

[16]  Tim Fingscheidt,et al.  Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance , 2020, ECCV.

[17]  Patrick Mäder,et al.  UnRectDepthNet: Self-Supervised Monocular Depth Estimation using a Generic Framework for Handling Common Camera Distortion Models , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Stefano Mattoccia,et al.  On the Uncertainty of Self-Supervised Monocular Depth Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Michel Sarkis,et al.  Multi-Task Learning for Single Image Depth Estimation and Segmentation Based on Unsupervised Network , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Gustavo Carneiro,et al.  Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Nan Yang,et al.  D3VO: Deep Depth, Deep Pose and Deep Uncertainty for Monocular Visual Odometry , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Rares Ambrus,et al.  Semantically-Guided Representation Learning for Self-Supervised Monocular Depth , 2020, ICLR.

[23]  S. Levine,et al.  Gradient Surgery for Multi-Task Learning , 2020, NeurIPS.

[24]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Hyunmin Lee,et al.  Real-Time Stereo Matching Network with High Accuracy , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[27]  Marco Körner,et al.  MultiDepth: Single-Image Depth Estimation via Multi-Task Regression and Classification , 2019, 2019 IEEE Intelligent Transportation Systems Conference (ITSC).

[28]  Alexander H. Liu,et al.  Towards Scene Understanding: Unsupervised Monocular Depth Estimation With Semantic-Aware Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jonathan T. Barron,et al.  Learning Single Camera Depth Estimation Using Dual-Pixels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Anelia Angelova,et al.  Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Vladlen Koltun,et al.  Multi-Task Learning as Multi-Objective Optimization , 2018, NeurIPS.

[33]  Andrea Vedaldi,et al.  Supervising the New with the Old: Learning SFM from SFM , 2018, ECCV.

[34]  Yinda Zhang,et al.  ActiveStereoNet: End-to-End Self-Supervised Learning for Active Stereo Systems , 2018, ECCV.

[35]  Nicu Sebe,et al.  PAD-Net: Multi-tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Thomas Brox,et al.  Uncertainty Estimates and Multi-hypotheses Networks for Optical Flow , 2018, ECCV.

[37]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[39]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[41]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[42]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[44]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Takanori Senoh,et al.  Fast depth estimation using non-iterative local optimization for super multi-view images , 2015, 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[48]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[51]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[52]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[53]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.