A Weight-shared Dual-branch Convolutional Neural Network for Unsupervised Dense Depth Prediction and Camera Motion Estimation

Convolutional Neural Network (CNN) can be used to indiscriminately predict dense depth and camera motion from images, however, ignoring the relationship between depth map and camera motion increases the computational burden to label the datasets and limits the accuracy of the results. In this paper, an end-to-end unsupervised dual-branch CNN is proposed to predict a pixel-wise depth map and simultaneously estimate camera pose. In particular, a weight sharing strategy for two branches is designed to increase the connection between depth map and camera motion. Besides, to reduce the impact of photometric noise, the intermediate feature maps are utilized to compute feature errors. Experimental results on the KITTI datasets demonstrate that our method achieves better performance on dense map prediction and camera pose estimation comparing with the state-of-the-art approaches.

[1]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[4]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[5]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[6]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[7]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Hong Liu,et al.  A novel re-tracking strategy for monocular SLAM , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Thomas Brox,et al.  A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yan Lu,et al.  Robust RGB-D Odometry Using Point and Line Features , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Hong Liu,et al.  Online Initialization and Automatic Camera-IMU Extrinsic Calibration for Monocular Visual-Inertial SLAM , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[14]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cordelia Schmid,et al.  SfM-Net: Learning of Structure and Motion from Video , 2017, ArXiv.

[16]  Jörg Stückler,et al.  Large-scale direct SLAM with stereo cameras , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Hirokazu Kato,et al.  SlidAR: A 3D positioning method for SLAM-based handheld augmented reality , 2016, Comput. Graph..

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.