To Learn or Not to Learn: Visual Localization from Essential Matrices

Visual localization is the problem of estimating a camera within a scene and a key technology for autonomous robots. State-of-the-art approaches for accurate visual localization use scene-specific representations, resulting in the overhead of constructing these models when applying the techniques to new scenes. Recently, learned approaches based on relative pose estimation have been proposed, carrying the promise of easily adapting to new scenes. However, they are currently significantly less accurate than state-of-the-art approaches. In this paper, we are interested in analyzing this behavior. To this end, we propose a novel framework for visual localization from relative poses. Using a classical feature-based approach within this framework, we show state-of-the-art performance. Replacing the classical approach with learned alternatives at various levels, we then identify the reasons for why deep learned approaches do not perform well. Based on our analysis, we make recommendations for future work.

[1]  Changchang Wu,et al.  Towards Linear-Time Incremental Structure from Motion , 2013, 2013 International Conference on 3D Vision.

[2]  Vladlen Koltun,et al.  Deep Fundamental Matrix Estimation , 2018, ECCV.

[3]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Torsten Sattler,et al.  Hybrid Scene Compression for Visual Localization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Desire Sidibé,et al.  Learning Scene Geometry for Visual Localization in Challenging Conditions , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[6]  Torsten Sattler,et al.  Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Roberto Cipolla,et al.  Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Fei Wang,et al.  Eigendecomposition-free Training of Deep Networks with Zero Eigenvalue-based Losses , 2018, ECCV.

[9]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Wolfram Burgard,et al.  Deep Auxiliary Learning for Visual Localization and Odometry , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Esa Rahtu,et al.  Relative Camera Pose Estimation Using Convolutional Neural Networks , 2017, ACIVS.

[13]  P. J. Narayanan,et al.  Visibility Probability Structure from SfM Datasets and Applications , 2012, ECCV.

[14]  Dieter Schmalstieg,et al.  Wide area localization on mobile phones , 2009, 2009 8th IEEE International Symposium on Mixed and Augmented Reality.

[15]  James J. Little,et al.  Backtracking regression forests for accurate camera relocalization , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Wolfram Burgard,et al.  Deep regression for monocular camera-based 6-DoF global localization in outdoor environments , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Torsten Sattler,et al.  Understanding the Limitations of CNN-Based Absolute Camera Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Tomás Pajdla,et al.  Neighbourhood Consensus Networks , 2018, NeurIPS.

[19]  Esa Rahtu,et al.  Image-Based Localization Using Hourglass Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[20]  Noah Snavely,et al.  Minimal Scene Descriptions from Structure from Motion Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Michael Bosse,et al.  Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization , 2015, Robotics: Science and Systems.

[22]  Michael F. Cohen,et al.  Real-time image-based 6-DOF localization in large-scale environments , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Tomás Pajdla,et al.  Visual localization by linear combination of image descriptors , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[24]  Torsten Sattler,et al.  Semantic Visual Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Daniel P. Huttenlocher,et al.  Location Recognition Using Prioritized Feature Matching , 2010, ECCV.

[26]  Andrew W. Fitzgibbon,et al.  Exploiting uncertainty in regression forests for accurate camera relocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Roland Siegwart,et al.  A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation , 2011, CVPR 2011.

[28]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[30]  Torsten Sattler,et al.  Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Vincent Lepetit,et al.  Learning to Find Good Correspondences , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Jan Kautz,et al.  Geometry-Aware Learning of Maps for Camera Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Eric Brachmann,et al.  Learning Less is More - 6D Camera Localization via 3D Surface Regression , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Mubarak Shah,et al.  Image Geo-Localization Based on MultipleNearest Neighbor Feature Matching UsingGeneralized Graphs , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[36]  Eric Brachmann,et al.  DSAC — Differentiable RANSAC for Camera Localization , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[38]  Josef Sivic,et al.  Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Ilya Kostrikov,et al.  PlaNet - Photo Geolocation with Convolutional Neural Networks , 2016, ECCV.

[40]  Torsten Sattler,et al.  Efficient 2D-3D Matching for Multi-Camera Visual Localization , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[41]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[43]  Torsten Sattler,et al.  Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Wei Zhang,et al.  Image Based Localization in Urban Environments , 2006, Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT'06).

[45]  Dieter Schmalstieg,et al.  Discriminative Feature-to-Point Matching in Image-Based Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Masatoshi Okutomi,et al.  24/7 Place Recognition by View Synthesis , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Jiri Matas,et al.  Fixing the Locally Optimized RANSAC , 2012, BMVC.

[48]  Roland Siegwart,et al.  Leveraging Deep Visual Descriptors for Hierarchical Efficient Localization , 2018, CoRL.

[49]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[50]  David W. Murray,et al.  Video-rate localization in multiple maps for wearable augmented reality , 2008, 2008 12th IEEE International Symposium on Wearable Computers.

[51]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Eric Brachmann,et al.  Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Torsten Sattler,et al.  D2-Net: A Trainable CNN for Joint Detection and Description of Local Features , 2019, CVPR 2019.

[54]  Andrew Zisserman,et al.  DisLocation: Scalable Descriptor Distinctiveness for Location Recognition , 2014, ACCV.

[55]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[56]  Sen Wang,et al.  VidLoc: A Deep Spatio-Temporal Model for 6-DoF Video-Clip Relocalization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  Shuda Li,et al.  RelocNet: Continuous Metric Learning Relocalisation Using Neural Nets , 2018, ECCV.

[60]  Jan-Michael Frahm,et al.  From structure-from-motion point clouds to fast location recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Daniel Cremers,et al.  Image-Based Localization Using LSTMs for Structured Feature Correlation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Juho Kannala,et al.  Camera Relocalization by Computing Pairwise Relative Poses Using Convolutional Neural Network , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[64]  Torsten Sattler,et al.  Camera Pose Voting for Large-Scale Image-Based Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[65]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Luigi di Stefano,et al.  On-the-Fly Adaptation of Regression Forests for Online Camera Relocalisation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Torsten Sattler,et al.  InLoc: Indoor Visual Localization with Dense Matching and View Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Noah Snavely,et al.  Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[70]  Torsten Sattler,et al.  Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[71]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Wolfram Burgard,et al.  VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry , 2018, IEEE Robotics and Automation Letters.

[73]  Roland Siegwart,et al.  From Coarse to Fine: Robust Hierarchical Localization at Large Scale , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Masatoshi Okutomi,et al.  Visual Place Recognition with Repetitive Structures , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  David Nistér,et al.  An efficient solution to the five-point relative pose problem , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[77]  Andrew W. Fitzgibbon,et al.  Multi-output Learning for Camera Relocalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  C. V. Jawahar,et al.  Improved Visual Relocalization by Discovering Anchor Points , 2018, BMVC.

[79]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[80]  Fredrik Kahl,et al.  City-Scale Localization for Cameras with Known Vertical Direction , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Roberto Cipolla,et al.  Modelling uncertainty in deep learning for camera relocalization , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[82]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[83]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Gabriela Csurka,et al.  Visual Localization by Learning Objects-Of-Interest Dense Match Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[86]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87]  Torsten Sattler,et al.  Semantic Match Consistency for Long-Term Visual Localization , 2018, ECCV.

[88]  Mubarak Shah,et al.  Accurate Image Localization Based on Google Maps Street View , 2010, ECCV.

[89]  Pascal Fua,et al.  Worldwide Pose Estimation Using 3D Point Clouds , 2012, ECCV.

[90]  Noah Snavely,et al.  Graph-Based Discriminative Learning for Location Recognition , 2013, International Journal of Computer Vision.