论文信息 - Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling

Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant (e.g., O(100)) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a "global" loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D.

[1] Davide Scaramuzza,et al. SVO: Fast semi-direct monocular visual odometry , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[2] Sen Wang,et al. DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3] J. M. M. Montiel,et al. ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[4] Jia Deng,et al. DeepV2D: Video to Depth with Differentiable Structure from Motion , 2018, ICLR.

[5] Wolfram Burgard,et al. A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6] John J. Leonard,et al. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[7] Juan D. Tardós,et al. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[8] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[9] Friedrich Fraundorfer,et al. Visual Odometry Part I: The First 30 Years and Fundamentals , 2022 .

[10] R. Hartley. Triangulation, Computer Vision and Image Understanding , 1997 .

[11] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Hongbin Zha,et al. Beyond Tracking: Selecting Memory and Refining Poses for Deep Visual Odometry , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Yang Li,et al. Pose Graph optimization for Unsupervised Monocular Visual Odometry , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[14] Lei Zhou,et al. Beyond Photometric Loss for Self-Supervised Ego-Motion Estimation , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[15] Andreas Geiger,et al. Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[16] Bingbing Zhuang,et al. Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction , 2020, ECCV.

[17] Andrew W. Fitzgibbon,et al. Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[18] Ian D. Reid,et al. Unsupervised Learning of Monocular Depth Estimation and Visual Odometry with Deep Feature Reconstruction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Dan Xu,et al. Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Jason J. Corso,et al. A Continuous Occlusion Model for Road Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Ruben Villegas,et al. Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[22] Bingbing Zhuang,et al. Learning Structure-And-Motion-Aware Rolling Shutter Correction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Daniel Cremers,et al. Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24] G. Klein,et al. Parallel Tracking and Mapping for Small AR Workspaces , 2007, 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.

[25] Daniel Cremers,et al. Challenges in Monocular Visual Odometry: Photometric Calibration, Motion Bias, and Rolling Shutter Effect , 2017, IEEE Robotics and Automation Letters.

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Zhichao Yin,et al. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Stefan Leutenegger,et al. CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29] Wolfram Burgard,et al. G2o: A general framework for graph optimization , 2011, 2011 IEEE International Conference on Robotics and Automation.

[30] Sen Wang,et al. End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks , 2018, Int. J. Robotics Res..

[31] Jan-Michael Frahm,et al. Recurrent Neural Network for (Un-)Supervised Learning of Monocular Video Visual Odometry and Depth , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Yoshua Bengio,et al. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[33] Simon Lucey,et al. Learning Depth from Monocular Videos Using Direct Methods , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Thomas Brox,et al. DeepTAM: Deep Tracking and Mapping , 2018, ECCV.

[35] Julius Ziegler,et al. StereoScan: Dense 3d reconstruction in real-time , 2011, 2011 IEEE Intelligent Vehicles Symposium (IV).

[36] James R. Bergen,et al. Visual odometry , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[37] Shiyu Song,et al. Robust Scale Estimation in Real-Time Monocular SFM for Autonomous Driving , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38] Thomas Brox,et al. DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[40] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Chunhua Shen,et al. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video , 2019, NeurIPS.

[42] Jörg Stückler,et al. Direct Sparse Odometry with Rolling Shutter , 2018, ECCV.

[43] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44] Jia-Bin Huang,et al. DF-Net: Unsupervised Joint Learning of Depth and Flow using Cross-Task Consistency , 2018, ECCV.

[45] Ping Tan,et al. BA-Net: Dense Bundle Adjustment Network , 2018, ICLR 2018.

[46] Anelia Angelova,et al. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Gabriel J. Brostow,et al. Digging Into Self-Supervised Monocular Depth Estimation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48] Dongbing Gu,et al. UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[49] Alexandru Tupan,et al. Triangulation , 1997, Comput. Vis. Image Underst..

[50] Michael J. Black,et al. Competitive Collaboration: Joint Unsupervised Learning of Depth, Camera Motion, Optical Flow and Motion Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Loong Fah Cheong,et al. Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[52] Daniel Cremers,et al. Stereo DSO: Large-Scale Direct Sparse Visual Odometry with Stereo Cameras , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53] Hongbin Zha,et al. Guided Feature Selection for Deep Visual Odometry , 2018, ACCV.

[54] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[55] Daniel Cremers,et al. LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[56] Jean Charles Bazin,et al. DeepCalib: a deep learning approach for automatic intrinsic calibration of wide field-of-view cameras , 2018, CVMP '18.