Windowed Bundle Adjustment Framework for Unsupervised Learning of Monocular Depth Estimation With U-Net Extension and Clip Loss

This letter presents a self-supervised framework for learning depth from monocular videos. In particular, the main contributions of this letter include: (1) We present a windowed bundle adjustment framework to train the network. Compared to most previous works that only consider constraints from consecutive frames, our framework increases the camera baseline and introduces more constraints to avoid overfitting. (2) We extend the widely used U-Net architecture by applying a Spatial Pyramid Net (SPN) and a Super Resolution Net (SRN). The SPN fuses information from an image spatial pyramid for the depth estimation, which addresses the context information attenuation problem of the original U-Net. The SRN learns to estimate a high resolution depth map from a low resolution image, which can benefit the recovery of details. (3) We adopt a clip loss function to handle moving objects and occlusions that were solved by designing complicated network or requiring extra information (such as segmentation mask [1]) in previous works. Experimental results show that our algorithm provides state-of-the-art results on the KITTI benchmark.

[1]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[5]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[6]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jia-Bin Huang,et al.  DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency , 2018, ECCV.

[8]  Ian D. Reid,et al.  Self-supervised Learning for Single View Depth and Surface Normal Estimation , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[9]  Rares Ambrus,et al.  SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[10]  Ping Tan,et al.  BA-Net: Dense Bundle Adjustment Network , 2018, ICLR 2018.

[11]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Gabriel J. Brostow,et al.  Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Ian D. Reid,et al.  Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[15]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[16]  Suchendra M. Bhandarkar,et al.  Monocular Depth Prediction Using Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Chunhua Shen,et al.  Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[19]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[21]  Wei Xu,et al.  LEGO: Learning Edge with Geometry all at Once by Watching Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Stefano Mattoccia,et al.  Generative Adversarial Networks for Unsupervised Monocular Depth Prediction , 2018, ECCV Workshops.

[24]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Stefano Mattoccia,et al.  Towards Real-Time Unsupervised Monocular Depth Estimation on CPU , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Xiaogang Wang,et al.  Learning Monocular Depth by Distilling Cross-domain Stereo Networks , 2018, ECCV.

[28]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[34]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Wei Xu,et al.  Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding , 2018, ECCV Workshops.

[36]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Vignesh Prasad,et al.  SfMLearner++: Learning Monocular Depth & Ego-Motion Using Meaningful Geometric Constraints , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Anelia Angelova,et al.  Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos , 2018, AAAI.

[39]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[40]  Ce Liu,et al.  Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Wei Xu,et al.  Unsupervised Learning of Geometry with Edge-aware Depth-Normal Consistency , 2017, AAAI.