On the Limits of Pseudo Ground Truth in Visual Camera Re-localisation

Benchmark datasets that measure camera pose accuracy have driven progress in visual re-localisation research. To obtain poses for thousands of images, it is common to use a reference algorithm to generate pseudo ground truth. Popular choices include Structure-from-Motion (SfM) and Simultaneous-Localisation-and-Mapping (SLAM) using additional sensors like depth cameras if available. Re-localisation benchmarks thus measure how well each method replicates the results of the reference algorithm. This begs the question whether the choice of the reference algorithm favours a certain family of re-localisation methods. This paper analyzes two widely used re-localisation datasets and shows that evaluation outcomes indeed vary with the choice of the reference algorithm. We thus question common beliefs in the re-localisation literature, namely that learning-based scene coordinate regression outperforms classical feature-based methods, and that RGB-Dbased methods outperform RGB-based methods. We argue that any claims on ranking re-localisation methods should take the type of the reference algorithm, and the similarity of the methods to the reference algorithm, into account.

[1]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Masatoshi Okutomi,et al.  24/7 Place Recognition by View Synthesis , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Andreas Geiger,et al.  Object scene flow for autonomous vehicles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Eric Brachmann,et al.  Expert Sample Consensus Applied to Camera Re-Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Roberto Cipolla,et al.  Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Vladlen Koltun,et al.  Open3D: A Modern Library for 3D Data Processing , 2018, ArXiv.

[9]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[10]  Eric Brachmann,et al.  Visual Camera Re-Localization from RGB and RGB-D Images Using DSAC , 2020, ArXiv.

[11]  Dieter Schmalstieg,et al.  Real-time self-localization from panoramic images on mobile devices , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[12]  Dieter Fox,et al.  Self-Supervised Visual Descriptor Learning for Dense Correspondence , 2017, IEEE Robotics and Automation Letters.

[13]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[14]  Wolfgang Förstner,et al.  Photogrammetric Computer Vision: Statistics, Geometry, Orientation and Reconstruction , 2017 .

[15]  Pascal Fua,et al.  On benchmarking camera calibration and multi-view stereo for high resolution imagery , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Eric Brachmann,et al.  Learning Less is More - 6D Camera Localization via 3D Surface Regression , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Torsten Sattler,et al.  InLoc: Indoor Visual Localization with Dense Matching and View Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[19]  Eric Brachmann,et al.  Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Luigi di Stefano,et al.  On-the-Fly Adaptation of Regression Forests for Online Camera Relocalisation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Gabriela Csurka,et al.  Robust Image Retrieval-based Visual Localization using Kapture , 2020, ArXiv.

[22]  Torsten Sattler,et al.  Understanding the Limitations of CNN-Based Absolute Camera Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[24]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Juho Kannala,et al.  Hierarchical Scene Coordinate Classification and Regression for Visual Localization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Torsten Sattler,et al.  Beyond Controlled Environments: 3D Camera Re-Localization in Changing Indoor Scenes , 2020, ECCV.

[27]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[28]  Roland Siegwart,et al.  The EuRoC micro aerial vehicle datasets , 2016, Int. J. Robotics Res..

[29]  Jan-Michael Frahm,et al.  A Vote-and-Verify Strategy for Fast Spatial Verification in Image Retrieval , 2016, ACCV.

[30]  Daniel P. Huttenlocher,et al.  Location Recognition Using Prioritized Feature Matching , 2010, ECCV.

[31]  Xin Chen,et al.  City-scale landmark identification on mobile devices , 2011, CVPR 2011.

[32]  Robert M. Haralick,et al.  Review and analysis of solutions of the three point perspective pose estimation problem , 1994, International Journal of Computer Vision.

[33]  Torsten Sattler,et al.  BAD SLAM: Bundle Adjusted Direct RGB-D SLAM , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35]  Torsten Sattler,et al.  Comparative Evaluation of Hand-Crafted and Learned Local Features , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Torsten Sattler,et al.  D2-Net: A Trainable CNN for Joint Detection and Description of Local Features , 2019, CVPR 2019.

[37]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Torsten Sattler,et al.  Image Retrieval for Image-Based Localization Revisited , 2012, BMVC.

[39]  James J. Little,et al.  Backtracking regression forests for accurate camera relocalization , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[41]  Roland Siegwart,et al.  A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation , 2011, CVPR 2011.

[42]  David W. Murray,et al.  Video-rate localization in multiple maps for wearable augmented reality , 2008, 2008 12th IEEE International Symposium on Wearable Computers.

[43]  Ping Tan,et al.  SANet: Scene Agnostic Network for Camera Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Daniel Cremers,et al.  Image-Based Localization Using LSTMs for Structured Feature Correlation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Jan Kautz,et al.  Geometry-Aware Learning of Maps for Camera Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Gonen Eren,et al.  Evaluation of video activity localizations integrating quality and quantity measurements , 2014, Comput. Vis. Image Underst..

[47]  Liang Wang,et al.  A Dataset for Benchmarking Image-Based Localization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Torsten Sattler,et al.  Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[50]  Andrew W. Fitzgibbon,et al.  Exploiting uncertainty in regression forests for accurate camera relocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Gabriela Csurka,et al.  R2D2: Repeatable and Reliable Detector and Descriptor , 2019, ArXiv.

[52]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[54]  Changchang Wu,et al.  Towards Linear-Time Incremental Structure from Motion , 2013, 2013 International Conference on 3D Vision.

[55]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Jiri Matas,et al.  Fixing the Locally Optimized RANSAC , 2012, BMVC.

[57]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[58]  Eric Brachmann,et al.  DSAC — Differentiable RANSAC for Camera Localization , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Eric Brachmann,et al.  Neural-Guided RANSAC: Learning Where to Sample Model Hypotheses , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[62]  Michael F. Cohen,et al.  Real-time image-based 6-DOF localization in large-scale environments , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Torsten Sattler,et al.  Improving Image-Based Localization by Active Correspondence Search , 2012, ECCV.

[64]  Pascal Fua,et al.  Image Matching Across Wide Baselines: From Paper to Practice , 2020, International Journal of Computer Vision.

[65]  Daniel Kondermann,et al.  Is Crowdsourcing for Optical Flow Ground Truth Generation Feasible? , 2013, ICVS.

[66]  Torsten Sattler,et al.  Benchmarking 6DOF Urban Visual Localization in Changing Conditions , 2017, ArXiv.

[67]  Torsten Sattler,et al.  Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Matthias Nießner,et al.  Learning to Navigate the Energy Landscape , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[69]  Marc Levoy,et al.  Efficient variants of the ICP algorithm , 2001, Proceedings Third International Conference on 3-D Digital Imaging and Modeling.

[70]  Michael Bosse,et al.  Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization , 2015, Robotics: Science and Systems.

[71]  Roland Siegwart,et al.  From Coarse to Fine: Robust Hierarchical Localization at Large Scale , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Pascal Fua,et al.  Worldwide Pose Estimation Using 3D Point Clouds , 2012, ECCV.

[73]  Torsten Sattler,et al.  A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[75]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[76]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[77]  Luca Bertinetto,et al.  Let's Take This Online: Adapting Scene Coordinate Regression Network Predictions for Online RGB-D Camera Relocalisation , 2019, 2019 International Conference on 3D Vision (3DV).

[78]  Andrew W. Fitzgibbon,et al.  Multi-output Learning for Camera Relocalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Torsten Sattler,et al.  Reference Pose Generation for Long-term Visual Localization via Learned Features and View Synthesis , 2020, Int. J. Comput. Vis..

[81]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Torsten Sattler,et al.  Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  Luigi di Stefano,et al.  Real-Time RGB-D Camera Pose Estimation in Novel Scenes Using a Relocalisation Cascade , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[85]  Jianliang Tang,et al.  Complete Solution Classification for the Perspective-Three-Point Problem , 2003, IEEE Trans. Pattern Anal. Mach. Intell..