Unsupervised Learning of Monocular Depth and Large-Ego-Motion With Multiple Loop Consistency Losses

We propose UnLearnerVO, a jointly unsupervised learning framework for monocular depth, camera motion estimation from videos. UnLearnerVO is coupled with the relationships of 3D scene geometry and can estimate the 6-DoF pose of a monocular camera in an end-to-end pattern. There are two significant features of the proposed UnLearnerVO: one is an unsupervised depth learning pipeline based on the consecutive /inconsecutive frames, and the other is robustness in a scenario with large camera motion. Specifically, we deeply excavate the pose loop consistency loss, thus optimizing the camera pose and enforcing consistency of the estimated poses across consecutive and inconsecutive frames. Furthermore, a photometric loop consistency loss is proposed, which reduces the disturbance caused by factors such as the dynamic motion object and the photo inconsistency. The experiments on KITTI datasets show that our UnLearnerVO achieves the state-of-the-art results in large camera motion scenarios and performs better than the currently popular unsupervised approaches.

[1]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  John J. Leonard,et al.  Towards visual ego-motion learning in robots , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[5]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[7]  Sen Wang,et al.  VidLoc: 6-DoF Video-Clip Relocalization , 2017, ArXiv.

[8]  Wei Xu,et al.  Unsupervised Learning of Geometry with Edge-aware Depth-Normal Consistency , 2017, ArXiv.

[9]  Ian D. Reid,et al.  Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Zhichao Yin,et al.  GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Dinggang Shen,et al.  BIRNet: Brain image registration using dual‐supervised fully convolutional networks , 2018, Medical Image Anal..

[13]  Sen Wang,et al.  VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem , 2017, AAAI.

[14]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[15]  Jiamao Li,et al.  Stereo Visual-Inertial SLAM With Points and Lines , 2018, IEEE Access.

[16]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[17]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Simon Lucey,et al.  Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[20]  Paolo Valigi,et al.  Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation , 2016, IEEE Robotics and Automation Letters.

[21]  Mert R. Sabuncu,et al.  VoxelMorph: A Learning Framework for Deformable Medical Image Registration , 2018, IEEE Transactions on Medical Imaging.

[22]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[24]  Shiyin Qin,et al.  A Fast Algorithm of Simultaneous Localization and Mapping for Mobile Robot Based on Ball Particle Filter , 2018, IEEE Access.

[25]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[27]  Wei Xu,et al.  Every Pixel Counts: Unsupervised Geometry Learning with Holistic 3D Motion Understanding , 2018, ECCV Workshops.

[28]  Dongbing Gu,et al.  Indoor Relocalization in Challenging Environments With Dual-Stream Convolutional Neural Networks , 2018, IEEE Transactions on Automation Science and Engineering.

[29]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.