The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth

Self-supervised monocular depth estimation networks are trained to predict scene depth using nearby frames as a supervision signal during training. However, for many applications, sequence information in the form of video frames is also available at test time. The vast majority of monocular networks do not make use of this extra signal, thus ignoring valuable information that could be used to improve the predicted depth. Those that do, either use computationally expensive test-time refinement techniques or off-the- shelf recurrent networks, which only indirectly make use of the geometric information that is inherently available.We propose ManyDepth, an adaptive approach to dense depth estimation that can make use of sequence information at test time, when it is available. Taking inspiration from multi-view stereo, we propose a deep end-to-end cost volume based approach that is trained using self-supervision only. We present a novel consistency loss that encourages the network to ignore the cost volume when it is deemed unreliable, e.g. in the case of moving objects, and an augmentation scheme to cope with static cameras. Our detailed experiments on both KITTI and Cityscapes show that we outperform all published self-supervised baselines, including those that use single or multiple frames at test time.

[1]  Stephen Lin,et al.  DPSNet: End-to-end Deep Plane Sweep Stereo , 2019, ICLR.

[2]  H. Hirschmüller Stereo Processing by Semiglobal Matching and Mutual Information , 2008, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[4]  Stefano Mattoccia,et al.  Distilled Semantics for Comprehensive Scene Understanding from Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Andrew Zisserman,et al.  Monocular Depth Estimation with Self-supervised Instance Adaptation , 2020, ArXiv.

[6]  Alex Kendall,et al.  End-to-End Learning of Geometry and Context for Deep Stereo Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Gabriel J. Brostow,et al.  Self-Supervised Monocular Depth Hints , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Narendra Ahuja,et al.  DeepMVS: Learning Multi-view Stereopsis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ying Li,et al.  Exploiting Temporal Consistency for Real-Time Video Depth Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Michael J. Black,et al.  Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Stefano Mattoccia,et al.  Towards Real-Time Unsupervised Monocular Depth Estimation on CPU , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Stefan Leutenegger,et al.  DeepFusion: Real-Time Dense 3D Reconstruction for Monocular SLAM using Single-View Depth and Gradient Predictions , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[16]  Neel Joshi,et al.  Micro-Baseline Stereo , 2014 .

[17]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[18]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[19]  Javier Civera,et al.  Single-View and Multiview Depth Fusion , 2016, IEEE Robotics and Automation Letters.

[20]  Ruigang Yang,et al.  Domain-invariant Stereo Matching Networks , 2019, ECCV.

[21]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yong-Sheng Chen,et al.  Pyramid Stereo Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Ruigang Yang,et al.  Learning Depth with Convolutional Spatial Propagation Network , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Wei Chen,et al.  Learning for Disparity Estimation Through Feature Constancy , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Lu Fang,et al.  SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Qifeng Chen,et al.  Video Depth Estimation by Fusing Flow-to-Depth Proposals , 2019, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28]  Ping Tan,et al.  Polarimetric Dense Monocular SLAM , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Hongdong Li,et al.  Self-Supervised Learning for Stereo Matching with Self-Improving Ability , 2017, ArXiv.

[31]  Yinda Zhang,et al.  DeepSFM: Structure From Motion Via Deep Bundle Adjustment , 2019, ECCV.

[32]  Richard Szeliski,et al.  Consistent video depth estimation , 2020, ACM Trans. Graph..

[33]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Rares Ambrus,et al.  3D Packing for Self-Supervised Monocular Depth Estimation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Daniel Cremers,et al.  MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Richard Szeliski,et al.  Handling occlusions in dense multi-view stereo , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[37]  Ashutosh Saxena,et al.  Depth Estimation Using Monocular and Stereo Cues , 2007, IJCAI.

[38]  Tim Fingscheidt,et al.  Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance , 2020, ECCV.

[39]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[40]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Wei Xu,et al.  LEGO: Learning Edge with Geometry all at Once by Watching Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Xiaogang Wang,et al.  Learning Monocular Depth by Distilling Cross-domain Stereo Networks , 2018, ECCV.

[43]  Andrew J. Davison,et al.  Live dense reconstruction with a single moving camera , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  Yuxin Hou,et al.  Multi-View Stereo by Temporal Nonparametric Fusion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Hang Zhao,et al.  Unsupervised Monocular Depth Learning in Dynamic Scenes , 2020, CoRL.

[46]  Christian Theobalt,et al.  Occlusion-Aware Depth Estimation with Adaptive Normal Constraints , 2020, ECCV.

[47]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Guido C. H. E. de Croon,et al.  Fusion of Stereo and Still Monocular Depth Estimates in a Self-Supervised Learning Context , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[49]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Chunhua Shen,et al.  Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video , 2019, NeurIPS.

[51]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[52]  Chang Shu,et al.  Feature-metric Loss for Self-supervised Learning of Depth and Egomotion , 2020, ECCV.

[53]  Dacheng Tao,et al.  Deep Ordinal Regression Network for Monocular Depth Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Suchendra M. Bhandarkar,et al.  DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[56]  Baichuan Huang,et al.  M3VSNET: Unsupervised Multi-Metric Multi-View Stereo Network , 2020, 2021 IEEE International Conference on Image Processing (ICIP).

[57]  Anelia Angelova,et al.  Unsupervised Monocular Depth and Ego-Motion Learning With Structure and Semantics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[58]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[59]  Hujun Bao,et al.  Consistent Depth Maps Recovery from a Video Sequence , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Thomas Brox,et al.  CAM-Convs: Camera-Aware Multi-Scale Convolutions for Single-View Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Ruigang Yang,et al.  GA-Net: Guided Aggregation Net for End-To-End Stereo Matching , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Rares Ambrus,et al.  Semantically-Guided Representation Learning for Self-Supervised Monocular Depth , 2020, ICLR.

[63]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[64]  Thomas Brox,et al.  Sparsity Invariant CNNs , 2017, 2017 International Conference on 3D Vision (3DV).

[65]  Li Liu,et al.  Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues , 2020, ArXiv.

[66]  S. Mattoccia,et al.  Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation , 2020, ECCV.

[67]  Juan Luis Gonzalez,et al.  Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes , 2020, NeurIPS.

[68]  Lili Ju,et al.  Spatial Correspondence With Generative Adversarial Network: Learning Depth From Monocular Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Martial Hebert,et al.  Learning Unsupervised Multi-View Stereopsis via Robust Photometric Consistency , 2019, ArXiv.

[70]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[71]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[72]  Yi Yang,et al.  Supplementary Materials for UnOS: Unified Unsupervised Optical-flow and Stereo-depth Estimation by Watching Videos , 2019 .

[73]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[74]  Dengxin Dai,et al.  Don’t Forget The Past: Recurrent Depth Estimation from Monocular Video , 2020, IEEE Robotics and Automation Letters.

[75]  Anelia Angelova,et al.  Depth From Videos in the Wild: Unsupervised Monocular Depth Learning From Unknown Cameras , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[76]  Nicu Sebe,et al.  Unsupervised Adversarial Depth Estimation Using Cycled Generative Networks , 2018, 2018 International Conference on 3D Vision (3DV).

[77]  Wei Xu,et al.  Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  Chao Liu,et al.  Neural RGB®D Sensing: Depth and Uncertainty From a Video Camera , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Andrea Vedaldi,et al.  Supervising the New with the Old: Learning SFM from SFM , 2018, ECCV.

[80]  In-So Kweon,et al.  High-Quality Depth from Uncalibrated Small Motion Clip , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Siyu Zhu,et al.  Cascade Cost Volume for High-Resolution Multi-View Stereo and Stereo Matching , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[83]  Shaojie Shen,et al.  MVDepthNet: Real-Time Multiview Depth Estimation Neural Network , 2018, 2018 International Conference on 3D Vision (3DV).

[84]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[85]  Luc Van Gool,et al.  CoMoDA: Continuous Monocular Depth Adaptation Using Past Experiences , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[86]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  William T. Freeman,et al.  Learning the Depths of Moving People by Watching Frozen People , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Swagat Kumar,et al.  UnDEMoN: Unsupervised Deep Network for Depth and Ego-Motion Estimation , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[89]  Gustavo Carneiro,et al.  Self-Supervised Monocular Trained Depth Estimation Using Self-Attention and Discrete Disparity Volume , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[91]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[92]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[93]  Nikolai Smolyanskiy,et al.  On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[94]  Stefan Leutenegger,et al.  CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[95]  Bo Li,et al.  MVS2: Deep Unsupervised Multi-View Stereo with Multi-View Symmetry , 2019, 2019 International Conference on 3D Vision (3DV).

[96]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[97]  Jan-Michael Frahm,et al.  Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Robert T. Collins,et al.  A space-sweep approach to true multi-image matching , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.