Deep Patch Visual Odometry

We propose Deep Patch Visual Odometry (DPVO), a new deep learning system for monocular Visual Odometry (VO). DPVO uses a novel recurrent network architecture designed for tracking image patches across time. Recent approaches to VO have significantly improved the state-of-the-art accuracy by using deep networks to predict dense flow between video frames. However, using dense flow incurs a large computational cost, making these previous methods impractical for many use cases. Despite this, it has been assumed that dense flow is important as it provides additional redundancy against incorrect matches. DPVO disproves this assumption, showing that it is possible to get the best accuracy and efficiency by exploiting the advantages of sparse patch-based matching over dense flow. DPVO introduces a novel recurrent update operator for patch based correspondence coupled with differentiable bundle adjustment. On Standard benchmarks, DPVO outperforms all prior work, including the learning-based state-of-the-art VO-system (DROID) using a third of the memory while running 3x faster on average. Code is available at https://github.com/princeton-vl/DPVO

[1]  ZhiMin Duan,et al.  RGB-Fusion: Monocular 3D reconstruction with learned depth prediction , 2021, Displays.

[2]  Jia Deng,et al.  DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras , 2021, NeurIPS.

[3]  Marc Pollefeys,et al.  Pixel-Perfect Structure-from-Motion with Featuremetric Refinement , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Enrique Dunn,et al.  VOLDOR+SLAM: For the times when feature-based or direct methods are not good enough , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Hujun Bao,et al.  LoFTR: Detector-Free Local Feature Matching with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  L. Gool,et al.  Learning Accurate Dense Correspondences and When to Trust Them , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jia Deng,et al.  RAFT-3D: Scene Flow using Rigid-Motion Embeddings , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Carlos Campos,et al.  ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual–Inertial, and Multimap SLAM , 2020, IEEE Transactions on Robotics.

[9]  Sebastian Scherer,et al.  TartanVO: A Generalizable Learning-based VO , 2020, CoRL.

[10]  Yiding Yang,et al.  VOLDOR: Visual Odometry From Log-Logistic Dense Optical Flow Residuals , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ashish Kapoor,et al.  TartanAir: A Dataset to Push the Limits of Visual SLAM , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Jia Deng,et al.  RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , 2020, ECCV.

[13]  Johannes L. Schönberger,et al.  Multi-View Optimization of Local Feature Geometry , 2020, ECCV.

[14]  Andrew J. Davison,et al.  DeepFactors: Real-Time Probabilistic Dense Monocular SLAM , 2020, IEEE Robotics and Automation Letters.

[15]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  I. Reid,et al.  Visual Odometry Revisited: What Should Be Learnt? , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[18]  Torsten Sattler,et al.  BAD SLAM: Bundle Adjusted Direct RGB-D SLAM , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ping Tan,et al.  BA-Net: Dense Bundle Adjustment Network , 2018, ICLR 2018.

[20]  Daniel Cremers,et al.  Direct Sparse Visual-Inertial Odometry Using Dynamic Marginalization , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[22]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Frank Dellaert,et al.  Factor Graphs for Robot Perception , 2017, Found. Trends Robotics.

[24]  Daniel Cremers,et al.  Stereo DSO: Large-Scale Direct Sparse Visual Odometry with Stereo Cameras , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Michael Gassner,et al.  SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems , 2017, IEEE Transactions on Robotics.

[26]  I. Reid,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[27]  Roland Siegwart,et al.  The EuRoC micro aerial vehicle datasets , 2016, Int. J. Robotics Res..

[28]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[29]  Frank Dellaert,et al.  IMU Preintegration on Manifold for Efficient Visual-Inertial Maximum-a-Posteriori Estimation , 2015, Robotics: Science and Systems.

[30]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[31]  Davide Scaramuzza,et al.  SVO: Fast semi-direct monocular visual odometry , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[34]  Roland Siegwart,et al.  Keyframe-Based Visual-Inertial SLAM using Nonlinear Optimization , 2013, Robotics: Science and Systems.

[35]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[36]  F. Dellaert Factor Graphs and GTSAM: A Hands-on Introduction , 2012 .

[37]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[38]  Kurt Konolige,et al.  g 2 o: A general Framework for (Hyper) Graph Optimization , 2011 .

[39]  Stergios I. Roumeliotis,et al.  A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation , 2007, Proceedings 2007 IEEE International Conference on Robotics and Automation.

[40]  Andrew J. Davison,et al.  Real-time simultaneous localisation and mapping with a single camera , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[41]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[42]  S. Umeyama,et al.  Least-Squares Estimation of Transformation Parameters Between Two Point Patterns , 1991, IEEE Trans. Pattern Anal. Mach. Intell..