Unsupervised Learning of Monocular Depth Estimation with Bundle Adjustment, Super-Resolution and Clip Loss

We present a novel unsupervised learning framework for single view depth estimation using monocular videos. It is well known in 3D vision that enlarging the baseline can increase the depth estimation accuracy, and jointly optimizing a set of camera poses and landmarks is essential. In previous monocular unsupervised learning frameworks, only part of the photometric and geometric constraints within a sequence are used as supervisory signals. This may result in a short baseline and overfitting. Besides, previous works generally estimate a low resolution depth from a low resolution impute image. The low resolution depth is then interpolated to recover the original resolution. This strategy may generate large errors on object boundaries, as the depth of background and foreground are mixed to yield the high resolution depth. In this paper, we introduce a bundle adjustment framework and a super-resolution network to solve the above two problems. In bundle adjustment, depths and poses of an image sequence are jointly optimized, which increases the baseline by establishing the relationship between farther frames. The super resolution network learns to estimate a high resolution depth from a low resolution image. Additionally, we introduce the clip loss to deal with moving objects and occlusion. Experimental results on the KITTI dataset show that the proposed algorithm outperforms the state-of-the-art unsupervised methods using monocular sequences, and achieves comparable or even better result compared to unsupervised methods using stereo sequences.

[1]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  William T. Freeman,et al.  Learning Ordinal Relationships for Mid-Level Vision , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Toby P. Breckon,et al.  Real-Time Monocular Depth Estimation Using Synthetic Data with Domain Adaptation via Image Style Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[7]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[8]  Ruigang Yang,et al.  Depth Estimation via Affinity Learned with Convolutional Spatial Propagation Network , 2018, ECCV.

[9]  Alan L. Yuille,et al.  Towards unified depth and semantic prediction from a single image , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[11]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[15]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[16]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[17]  Stefano Mattoccia,et al.  Towards Real-Time Unsupervised Monocular Depth Estimation on CPU , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Wei Xu,et al.  Unsupervised Learning of Geometry with Edge-aware Depth-Normal Consistency , 2017, AAAI.

[19]  Jörg Stückler,et al.  Semi-Supervised Deep Learning for Monocular Depth Map Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Chao Dong,et al.  Recovering Realistic Texture in Image Super-Resolution by Deep Spatial Feature Transform , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[24]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[25]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[26]  Stefano Mattoccia,et al.  Learning Monocular Depth Estimation with Unsupervised Trinocular Assumptions , 2018, 2018 International Conference on 3D Vision (3DV).

[27]  Stefano Mattoccia,et al.  Generative Adversarial Networks for Unsupervised Monocular Depth Prediction , 2018, ECCV Workshops.

[28]  P. J. Narayanan,et al.  Structured Adversarial Training for Unsupervised Monocular Depth Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[29]  Chunhua Shen,et al.  Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Nicu Sebe,et al.  Unsupervised Adversarial Depth Estimation Using Cycled Generative Networks , 2018, 2018 International Conference on 3D Vision (3DV).

[33]  John Flynn,et al.  Deep Stereo: Learning to Predict New Views from the World's Imagery , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Suchendra M. Bhandarkar,et al.  DepthNet: A Recurrent Neural Network Architecture for Monocular Depth Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  Narendra Ahuja,et al.  Fast and Accurate Image Super-Resolution with Deep Laplacian Pyramid Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Suchendra M. Bhandarkar,et al.  Monocular Depth Prediction Using Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[39]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Wei Xu,et al.  Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding , 2018, ECCV Workshops.

[42]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Xiaogang Wang,et al.  Learning Monocular Depth by Distilling Cross-domain Stereo Networks , 2018, ECCV.

[48]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Jia-Bin Huang,et al.  DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency , 2018, ECCV.

[52]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[55]  Marc Pollefeys,et al.  Pulling Things out of Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[59]  Wei Xu,et al.  LEGO: Learning Edge with Geometry all at Once by Watching Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.