Unsupervised Learning of Depth and Pose Estimation based on Continuous Frame Window

We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from video sequences. In common with recent work, we use an unsupervised end-to-end learning method, requiring monocular video sequences for training. What makes the difference is, our approach not only uses image reconstruction as the supervisory signal but also exploits the pose estimation method which was used in traditional SLAM approach to enhance the supervisory signal and add training constraints. In pose estimation, a continuous frame window is set to construct the pose graph. Our method uses single-view depth and multi-view pose networks, with a loss based on reconstructing nearby images to the target using the predicted depth and pose. During training, the networks are thus coupled by the loss but can be applied independently at test time. Our evaluation of experiments on the KITTI dataset proves the effectiveness of our method: 1) monocular depth performs superior to the supervised methods that use ground-truth depth data for training and the existing unsupervised learning method. Our method performs comparably with the supervised methods that use ground-truth pose data for training. 2) pose estimation performs almost the same compared to established SLAM systems under comparable input settings.

[1]  Manolis I. A. Lourakis A Brief Description of the Levenberg-Marquardt Algorithm Implemented by levmar , 2005 .

[2]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[3]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[5]  Oisin Mac Aodha,et al.  Unsupervised Monocular Depth Estimation with Left-Right Consistency , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[7]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[8]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Xuming He,et al.  Discrete-Continuous Depth Estimation from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Sing Bing Kang,et al.  Depth Transfer: Depth Extraction from Videos Using Nonparametric Sampling , 2016 .

[13]  Viorica Patraucean,et al.  Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[14]  Ashutosh Saxena,et al.  3-D Depth Reconstruction from a Single Still Image , 2007, International Journal of Computer Vision.

[15]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[18]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[19]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[20]  Konstantinos G. Derpanis,et al.  Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness , 2016, ECCV Workshops.

[21]  Alexei A. Efros,et al.  Automatic photo pop-up , 2005, SIGGRAPH 2005.

[22]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .