End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks

This paper studies visual odometry (VO) from the perspective of deep learning. After tremendous efforts in the robotics and computer vision communities over the past few decades, state-of-the-art VO algorithms have demonstrated incredible performance. However, since the VO problem is typically formulated as a pure geometric problem, one of the key features still missing from current VO systems is the capability to automatically gain knowledge and improve performance through learning. In this paper, we investigate whether deep neural networks can be effective and beneficial to the VO problem. An end-to-end, sequence-to-sequence probabilistic visual odometry (ESP-VO) framework is proposed for the monocular VO based on deep recurrent convolutional neural networks. It is trained and deployed in an end-to-end manner, that is, directly inferring poses and uncertainties from a sequence of raw images (video) without adopting any modules from the conventional VO pipeline. It can not only automatically learn effective feature representation encapsulating geometric information through convolutional neural networks, but also implicitly model sequential dynamics and relation for VO using deep recurrent neural networks. Uncertainty is also derived along with the VO estimation without introducing much extra computation. Extensive experiments on several datasets representing driving, flying and walking scenarios show competitive performance of the proposed ESP-VO to the state-of-the-art methods, demonstrating a promising potential of the deep learning technique for VO and verifying that it can be a viable complement to current VO systems.

[1]  Julius Ziegler,et al.  StereoScan: Dense 3d reconstruction in real-time , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[2]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[4]  Wolfram Burgard,et al.  Probabilistic Robotics (Intelligent Robotics and Autonomous Agents) , 2005 .

[5]  Hauke Strasdat,et al.  Scale Drift-Aware Large Scale Monocular SLAM , 2010, Robotics: Science and Systems.

[6]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[7]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Ingmar Posner,et al.  Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks , 2016, AAAI.

[9]  James R. Bergen,et al.  Visual odometry , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[10]  Francisco Angel Moreno,et al.  The Málaga urban dataset: High-rate stereo and LiDAR in a realistic urban scenario , 2014, Int. J. Robotics Res..

[11]  Ian D. Reid,et al.  On combining visual SLAM and visual odometry , 2010, 2010 IEEE International Conference on Robotics and Automation.

[12]  Viorica Patraucean,et al.  gvnn: Neural Network Library for Geometric Computer Vision , 2016, ECCV Workshops.

[13]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Roberto Cipolla,et al.  Convolutional networks for real-time 6-DOF camera relocalization , 2015, ArXiv.

[15]  Paolo Valigi,et al.  Evaluation of non-geometric methods for visual odometry , 2014, Robotics Auton. Syst..

[16]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Konrad Schindler,et al.  Online Multi-Target Tracking Using Recurrent Neural Networks , 2016, AAAI.

[18]  Sen Wang,et al.  VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Niko Sünderhauf,et al.  On the performance of ConvNet features for place recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[23]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Peirong Ji StereoScan : Dense 3 D Reconstruction in Real-time , 2016 .

[25]  Daniel Cremers,et al.  Semi-dense Visual Odometry for a Monocular Camera , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Paolo Valigi,et al.  Exploring Representation Learning With CNNs for Frame-to-Frame Ego-Motion Estimation , 2016, IEEE Robotics and Automation Letters.

[27]  Jörg Stückler,et al.  Large-scale direct SLAM with stereo cameras , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28]  Andreas E. Savakis,et al.  Flowdometry: An Optical Flow and Deep Learning Based Approach to Visual Odometry , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Abhinav Gupta,et al.  Marr Revisited: 2D-3D Alignment via Surface Normal Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[32]  A. Markham,et al.  - Temporal Model for 6-DoF VideoClip Relocalization , 2022 .

[33]  Ian D. Reid,et al.  Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[35]  C. Bishop Mixture density networks , 1994 .

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Roland Memisevic,et al.  Learning Visual Odometry with a Convolutional Network , 2015, VISAPP.

[38]  Wolfram Burgard,et al.  G2o: A general framework for graph optimization , 2011, 2011 IEEE International Conference on Robotics and Automation.

[39]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Roberto Cipolla,et al.  Modelling uncertainty in deep learning for camera relocalization , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Thia Kirubarajan,et al.  Estimation with Applications to Tracking and Navigation: Theory, Algorithms and Software , 2001 .

[42]  Andrew J. Davison,et al.  Active search for real-time vision , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[43]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[45]  Peter Cheeseman,et al.  On the Representation and Estimation of Spatial Uncertainty , 1986 .

[46]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[47]  W. Burgard,et al.  D Mapping with an RGB-D Camera , 2014 .

[48]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[49]  Hauke Strasdat,et al.  Local accuracy and global consistency for efficient SLAM , 2012 .

[50]  Dimitrios G. Kottas,et al.  Camera-IMU-based localization: Observability analysis and consistency improvement , 2014, Int. J. Robotics Res..

[51]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[52]  Dorian Gálvez-López,et al.  Bags of Binary Words for Fast Place Recognition in Image Sequences , 2012, IEEE Transactions on Robotics.

[53]  Gustavo Carneiro,et al.  Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue , 2016, ECCV.

[54]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Friedrich Fraundorfer,et al.  Visual Odometry Part I: The First 30 Years and Fundamentals , 2022 .

[56]  Roland Siegwart,et al.  The EuRoC micro aerial vehicle datasets , 2016, Int. J. Robotics Res..

[57]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[58]  Clark C. Guest,et al.  High Accuracy Monocular SFM and Scale Correction for Autonomous Driving , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[60]  Sen Wang,et al.  DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[61]  Roberto Cipolla,et al.  Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[63]  Fabio Ramos,et al.  Semi-parametric learning for visual odometry , 2013, Int. J. Robotics Res..

[64]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Davide Scaramuzza,et al.  SVO: Fast semi-direct monocular visual odometry , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[66]  Esa Rahtu,et al.  Relative Camera Pose Estimation Using Convolutional Neural Networks , 2017, ACIVS.

[67]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Tucker R. Balch,et al.  Memory-based learning for visual odometry , 2008, 2008 IEEE International Conference on Robotics and Automation.

[69]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[70]  Wolfram Burgard,et al.  3-D Mapping With an RGB-D Camera , 2014, IEEE Transactions on Robotics.

[71]  Hauke Strasdat,et al.  Visual SLAM: Why filter? , 2012, Image Vis. Comput..

[72]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[73]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[74]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[75]  Chunping Li,et al.  Deep Convolutional Neural Network for 6-DOF Image Localization , 2016, ArXiv.

[76]  Sen Wang,et al.  VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem , 2017, AAAI.

[77]  F. Fraundorfer,et al.  Visual Odometry : Part II: Matching, Robustness, Optimization, and Applications , 2012, IEEE Robotics & Automation Magazine.

[78]  Olivier Stasse,et al.  MonoSLAM: Real-Time Single Camera SLAM , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79]  G. Klein,et al.  Parallel Tracking and Mapping for Small AR Workspaces , 2007, 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.

[80]  Jitendra Malik,et al.  Generic 3D Representation via Pose Estimation and Matching , 2016, ECCV.

[81]  Michael Milford,et al.  Place Recognition with ConvNet Landmarks: Viewpoint-Robust, Condition-Robust, Training-Free , 2015, Robotics: Science and Systems.

[82]  Wolfram Burgard,et al.  Robust visual SLAM across seasons , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[83]  Larry H. Matthies,et al.  Error modeling in stereo navigation , 1986, IEEE J. Robotics Autom..

[84]  Clark C. Guest,et al.  Parallel, real-time monocular visual odometry , 2013, 2013 IEEE International Conference on Robotics and Automation.