Incorporating learnt local and global embeddings into monocular visual SLAM

Traditional approaches for Visual Simultaneous Localization and Mapping (VSLAM) rely on low-level vision information for state estimation, such as handcrafted local features or the image gradient. While significant progress has been made through this track, under more challenging configuration for monocular VSLAM, e.g., varying illumination, the performance of state-of-the-art systems generally degrades. As a consequence, robustness and accuracy for monocular VSLAM are still widely concerned. This paper presents a monocular VSLAM system that fully exploits learnt features for better state estimation. The proposed system leverages both learnt local features and global embeddings at different modules of the system: direct camera pose This work was supported by the National Natural Science Foundation of China, under grant No. U1713211, Collaborative Research Fund by Research Grants Council Hong Kong, under Project No. C4063-18G, and HKUST-SJTU Joint Research Collaboration Fund, under project SJTU20EG03, awarded to Prof. Ming Liu. Ming Liu is the corresponding author. · Huaiyang Huang E-mail: hhuangat@connect.ust.hk Haoyang Ye E-mail: hy.ye@connect.ust.hk Lujia Wang E-mail: eewanglj@ust.hk Ming Liu (corresponding author) E-mail: eelium@ust.hk Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, SAR, China · Yuxiang Sun E-mail: yx.sun@polyu.edu.hk, sun.yuxiang@outlook.com Department of Mechanical Engineering, The Hong Kong Polytechnic University, Hung Hom Hong Kong, SAR, China estimation, inter-frame feature association, and loop closure detection. With a probabilistic explanation of keypoint prediction, we formulate the camera pose tracking in a direct manner and parameterize local features with uncertainty taken into account. To alleviate the quantization effect, we adapt the mapping module to generate 3D landmarks better to guarantee the system’s robustness. Detecting temporal loop closure via deep global embeddings further improves the robustness and accuracy of the proposed system. The proposed system is extensively evaluated on public datasets (Tsukuba, EuRoC, and KITTI), and compared against the state-of-theart methods. The competitive performance of camera pose estimation confirms the effectiveness of our method.

[1]  Tom Drummond,et al.  Machine Learning for High-Speed Corner Detection , 2006, ECCV.

[2]  Sen Wang,et al.  DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[4]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[5]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[6]  Hongdong Li,et al.  Canny-VO: Visual Odometry With RGB-D Cameras Based on Geometric 3-D–2-D Edge Alignment , 2019, IEEE Transactions on Robotics.

[7]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Marc Pollefeys,et al.  Illumination change robustness in direct visual SLAM , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[10]  H. Jin Kim,et al.  Robust visual localization in changing lighting conditions , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Tomasz Malisiewicz,et al.  Self-Improving Visual Odometry , 2018, ArXiv.

[12]  Dongbing Gu,et al.  UnDeepVO: Monocular Visual Odometry Through Unsupervised Deep Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Brett Browning,et al.  Direct Visual Odometry in Low Light Using Binary Descriptors , 2017, IEEE Robotics and Automation Letters.

[14]  Shaojie Shen,et al.  Edge alignment-based visual–inertial fusion for tracking of aggressive motions , 2018, Auton. Robots.

[15]  Roland Siegwart,et al.  From Coarse to Fine: Robust Hierarchical Localization at Large Scale , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  G. Klein,et al.  Parallel Tracking and Mapping for Small AR Workspaces , 2007, 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality.

[17]  Roland Siegwart,et al.  The EuRoC micro aerial vehicle datasets , 2016, Int. J. Robotics Res..

[18]  Torsten Sattler,et al.  D2-Net: A Trainable CNN for Joint Detection and Description of Local Features , 2019, CVPR 2019.

[19]  Andrew W. Fitzgibbon,et al.  Towards Pointless Structure from Motion: 3D Reconstruction and Camera Parameters from General 3D Curves , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Ming Liu,et al.  Monocular Visual Odometry using Learned Repeatability and Description , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[22]  Ping Tan,et al.  BA-Net: Dense Bundle Adjustment Network , 2018, ICLR.

[23]  Roland Siegwart,et al.  C-blox: A Scalable and Consistent TSDF-based Dense Mapping Approach , 2017, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  Anelia Angelova,et al.  Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Shaojie Shen,et al.  Robust camera motion estimation using direct edge alignment and sub-gradient method , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[27]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[28]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[29]  Syamsiah Mashohor,et al.  CNN-SVO: Improving the Mapping in Semi-Direct Visual Odometry Using Single-Image Depth Prediction , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[30]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[31]  Daniel Cremers,et al.  GN-Net: The Gauss-Newton Loss for Multi-Weather Relocalization , 2020, IEEE Robotics and Automation Letters.

[32]  Lourdes Agapito,et al.  MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects , 2018, 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[33]  Davide Scaramuzza,et al.  SVO: Fast semi-direct monocular visual odometry , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[34]  John Folkesson,et al.  GCNv2: Efficient Correspondence Prediction for Real-Time SLAM , 2019, IEEE Robotics and Automation Letters.

[35]  Carlo Tomasi,et al.  Good features to track , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Galvez-LópezDorian,et al.  Bags of Binary Words for Fast Place Recognition in Image Sequences , 2012 .

[37]  Davide Scaramuzza,et al.  Active exposure control for robust visual odometry in HDR environments , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Yuxiang Sun,et al.  Motion removal for reliable RGB-D SLAM in dynamic environments , 2018, Robotics Auton. Syst..

[39]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[40]  Fukui Kazuhiro,et al.  Realistic CG Stereo Image Dataset With Ground Truth Disparity Maps , 2012 .

[41]  Kenichi Kanatani,et al.  Do We Really Have to Consider Covariance Matrices for Image Feature Points , 2002 .

[42]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[46]  Andrea Vedaldi,et al.  HPatches: A Benchmark and Evaluation of Handcrafted and Learned Local Descriptors , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[48]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[49]  Wojciech Chojnacki,et al.  What Value Covariance Information in Estimating Vision Parameters? , 2001, ICCV.

[50]  Dorian Gálvez-López,et al.  Bags of Binary Words for Fast Place Recognition in Image Sequences , 2012, IEEE Transactions on Robotics.

[51]  Kurt Konolige,et al.  Double window optimisation for constant time visual SLAM , 2011, 2011 International Conference on Computer Vision.

[52]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[53]  Shichao Yang,et al.  CubeSLAM: Monocular 3-D Object SLAM , 2018, IEEE Transactions on Robotics.

[54]  Daniel Cremers,et al.  Challenges in Monocular Visual Odometry: Photometric Calibration, Motion Bias, and Rolling Shutter Effect , 2017, IEEE Robotics and Automation Letters.