Revealing the Reciprocal Relations between Self-Supervised Stereo and Monocular Depth Estimation

Current self-supervised depth estimation algorithms mainly focus on either stereo or monocular only, neglecting the reciprocal relations between them. In this paper, we propose a simple yet effective framework to improve both stereo and monocular depth estimation by leveraging the underlying complementary knowledge of the two tasks. Our approach consists of three stages. In the first stage, the proposed stereo matching network termed StereoNet is trained on image pairs in a self-supervised manner. Second, we introduce an occlusion-aware distillation (OA Distillation) module, which leverages the predicted depths from StereoNet in non-occluded regions to train our monocular depth estimation network named SingleNet. At last, we design an occlusion-aware fusion module (OA Fusion), which generates more reliable depths by fusing estimated depths from StereoNet and SingleNet given the occlusion map. Furthermore, we also take the fused depths as pseudo labels to supervise StereoNet in turn, which brings StereoNet’s performance to a new height. Extensive experiments on KITTI dataset demonstrate the effectiveness of our proposed framework. We achieve new SOTA performance on both stereo and monocular depth estimation tasks.

[1]  Liang Liu,et al.  HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation , 2020, AAAI.

[2]  Peter Wonka,et al.  AdaBins: Depth Estimation Using Adaptive Bins , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Haibin Ling,et al.  3D Mapping and 6D Pose Computation for Real Time Augmented Reality on Cylindrical Objects , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Juan Luis Gonzalez,et al.  Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes , 2020, NeurIPS.

[5]  Chang Shu,et al.  Feature-metric Loss for Self-supervised Learning of Depth and Egomotion , 2020, ECCV.

[6]  Ying Tai,et al.  Learning by Analogy: Reliable Supervision From Transformations for Unsupervised Optical Flow Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[8]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[9]  Gabriel J. Brostow,et al.  Self-Supervised Monocular Depth Hints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Chunhua Shen,et al.  Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video , 2019, NeurIPS.

[11]  Il Hong Suh,et al.  From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation , 2019, ArXiv.

[12]  Xia Li,et al.  6D-VNet: End-To-End 6DoF Vehicle Pose Estimation From Monocular RGB Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Adrien Gaidon,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Stefano Mattoccia,et al.  Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Nicu Sebe,et al.  Refine and Distill: Exploiting Cycle-Inconsistency and Knowledge Distillation for Unsupervised Monocular Depth Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xiaogang Wang,et al.  Group-Wise Correlation Stereo Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Michael R. Lyu,et al.  DDFlow: Learning Optical Flow with Unlabeled Data Distillation , 2019, AAAI.

[19]  Adrien Gaidon,et al.  ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Wei Xu,et al.  Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Rares Ambrus,et al.  SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[22]  Dieter Fox,et al.  Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[23]  Hongdong Li,et al.  Open-World Stereo Video Matching with Deep RNN , 2018, ECCV.

[24]  Wei Xu,et al.  Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding , 2018, ECCV Workshops.

[25]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  T. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[30]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Hong Zhang,et al.  Unsupervised Learning of Stereo Matching , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Chong Wang,et al.  Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification , 2017, 1709.02929.

[33]  Deqing Sun,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Hongdong Li,et al.  Self-Supervised Learning for Stereo Matching with Self-Improving Ability , 2017, ArXiv.

[35]  Nicu Sebe,et al.  Multi-scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Yale Song,et al.  Learning from Noisy Labels with Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Jan Kautz,et al.  Loss Functions for Image Restoration With Neural Networks , 2017, IEEE Transactions on Computational Imaging.

[39]  Éric Marchand,et al.  Pose Estimation for Augmented Reality: A Hands-On Survey , 2016, IEEE Transactions on Visualization and Computer Graphics.

[40]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[42]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Max Jaderberg,et al.  Spatial Transformer Networks , 2015, NIPS.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  R. Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Kurt Keutzer,et al.  Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow , 2010, ECCV.

[48]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[49]  Yi Yang,et al.  Supplementary Materials for UnOS: Unified Unsupervised Optical-flow and Stereo-depth Estimation by Watching Videos , 2019 .

[50]  H. Hirschmüller Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information , 2005, CVPR.