SACReg: Scene-Agnostic Coordinate Regression for Visual Localization

Scene coordinates regression (SCR), i.e., predicting 3D coordinates for every pixel of a given image, has recently shown promising potential. However, existing methods remain mostly scene-specific or limited to small scenes and thus hardly scale to realistic datasets. In this paper, we propose a new paradigm where a single generic SCR model is trained once to be then deployed to new test scenes, regardless of their scale and without further finetuning. For a given query image, it collects inputs from off-the-shelf image retrieval techniques and Structure-from-Motion databases: a list of relevant database images with sparse pointwise 2D-3D annotations. The model is based on the transformer architecture and can take a variable number of images and sparse 2D-3D annotations as input. It is trained on a few diverse datasets and significantly outperforms other scene regression approaches on several benchmarks, including scene-specific models, for visual localization. In particular, we set a new state of the art on the Cambridge localization benchmark, even outperforming feature-matching-based approaches.

[1]  Ruifeng Li,et al.  DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching , 2023, Expert Syst. Appl..

[2]  Yasutaka Furukawa,et al.  NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  G. Csurka,et al.  CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Hao Zhao,et al.  SC-wLS: Towards Interpretable Feed-forward Camera Re-localization , 2022, ECCV.

[5]  G. Csurka,et al.  CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion , 2022, NeurIPS.

[6]  V. Prisacariu,et al.  Map-free Visual Relocalization: Metric Pose Relative to a Single Image , 2022, ECCV.

[7]  M. Pollefeys,et al.  Visual Localization via Few-Shot Scene Region Classification , 2022, 2022 International Conference on 3D Vision (3DV).

[8]  P. Tan,et al.  SceneSqueezer: Learning to Compress Scene for Camera Relocalization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Muhammad Zeshan Afzal,et al.  SemAttNet: Toward Attention-Based Semantic Aware Guided Depth Completion , 2022, IEEE Access.

[10]  J. Rosenblatt,et al.  Quantization , 2022, What Is a Quantum Field Theory?.

[11]  Diane Larlus,et al.  Learning Super-Features for Image Retrieval , 2022, ICLR.

[12]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Kai Xu,et al.  Decoupling Makes Weakly Supervised Local Feature Better , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Afshin Dehghan,et al.  ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data , 2021, NeurIPS Datasets and Benchmarks.

[15]  Angel X. Chang,et al.  Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI , 2021, NeurIPS Datasets and Benchmarks.

[16]  Angel X. Chang,et al.  Habitat 2.0: Training Home Assistants to Rearrange their Habitat , 2021, NeurIPS.

[17]  Hujun Bao,et al.  VS-Net: Voting with Segmentation for Visual Localization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jianlin Su,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.

[19]  Ping Tan,et al.  Learning Camera Localization via Dense Scene Matching , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Bin Li,et al.  PENet: Towards Precise and Efficient Image Guided Depth Completion , 2021, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[21]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[22]  Manolis I. A. Lourakis,et al.  A Consistently Fast and Globally Optimal Solution to the Perspective-n-Point Problem , 2020, ECCV.

[23]  César Roberto de Souza,et al.  Robust Image Retrieval-based Visual Localization using Kapture , 2020, ArXiv.

[24]  Giorgos Tolias,et al.  Learning and aggregating deep local descriptors for instance-level recognition , 2020, ECCV.

[25]  Kyungdon Joo,et al.  Non-Local Spatial Propagation Network for Depth Completion , 2020, ECCV.

[26]  Pascal Fua,et al.  DISK: Learning local features with policy gradient , 2020, NeurIPS.

[27]  Hunter Blanton,et al.  Extending Absolute Pose Regression to Multiple Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[28]  D. Scaramuzza,et al.  Reference Pose Generation for Long-term Visual Localization via Learned Features and View Synthesis , 2020, International Journal of Computer Vision.

[29]  Qianqian Wang,et al.  Learning Feature Descriptors using Camera Pose Supervision , 2020, ECCV.

[30]  Long Quan,et al.  KFNet: Learning Temporal Camera Relocalization Using Kalman Filtering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Long Quan,et al.  ASLFeat: Learning Local Features of Accurate Shape and Localization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  C. Rother,et al.  Visual Camera Re-Localization From RGB and RGB-D Images Using DSAC , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Ruigang Yang,et al.  CSPN++: Learning Context and Resource Aware Convolutional Spatial Propagation Networks for Depth Completion , 2019, AAAI.

[34]  Ping Tan,et al.  SANet: Scene Agnostic Network for Camera Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Jianping Shi,et al.  CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Juho Kannala,et al.  Hierarchical Scene Coordinate Classification and Regression for Visual Localization , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Chris Xiaoxuan Lu,et al.  AtLoc: Attention Guided Camera Localization , 2019, AAAI.

[38]  Weisi Lin,et al.  Cascaded Parallel Filtering for Memory-Efficient Image-Based Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Eric Brachmann,et al.  Expert Sample Consensus Applied to Camera Re-Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Torsten Sattler,et al.  To Learn or Not to Learn: Visual Localization from Essential Matrices , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Jon Almazán,et al.  Learning With Average Precision: Training Image Retrieval With a Listwise Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[43]  Torsten Sattler,et al.  D2-Net: A Trainable CNN for Joint Detection and Description of Local Features , 2019, CVPR 2019.

[44]  Xin Yu,et al.  SOSNet: Second Order Similarity Regularization for Local Descriptor Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jitendra Malik,et al.  Habitat: A Platform for Embodied AI Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Slobodan Ilic,et al.  DPOD: 6D Pose Object Detector and Refiner , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Luc Van Gool,et al.  Sparse and Noisy LiDAR Completion with RGB Guidance and Uncertainty , 2019, 2019 16th International Conference on Machine Vision Applications (MVA).

[48]  Roland Siegwart,et al.  From Coarse to Fine: Robust Hierarchical Localization at Large Scale , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Lei Zhou,et al.  GeoDesc: Learning Local Descriptors by Integrating Geometry Constraints , 2018, ECCV.

[50]  Torsten Sattler,et al.  Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Pascal Fua,et al.  LF-Net: Learning Local Features from Images , 2018, NeurIPS.

[52]  Zhengqi Li,et al.  MegaDepth: Learning Single-View Depth Prediction from Internet Photos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54]  Jan Kautz,et al.  Geometry-Aware Learning of Maps for Camera Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Eric Brachmann,et al.  Learning Less is More - 6D Camera Localization via 3D Surface Regression , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[57]  Torsten Sattler,et al.  Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[59]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Roberto Cipolla,et al.  Geometric Loss Functions for Camera Pose Regression with Deep Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Daniel Cremers,et al.  Image-based Localization with Spatial LSTMs , 2016, ArXiv.

[63]  Eric Brachmann,et al.  DSAC — Differentiable RANSAC for Camera Localization , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  John D. McGregor,et al.  Community , 2016, physiopraxis.

[65]  Daniel Rueckert,et al.  Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[68]  Andrew W. Fitzgibbon,et al.  Exploiting uncertainty in regression forests for accurate camera relocalization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[70]  Eric Brachmann,et al.  Learning 6D Object Pose Estimation Using 3D Object Coordinates , 2014, ECCV.

[71]  Andrew W. Fitzgibbon,et al.  Multi-output Learning for Camera Relocalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Andrew W. Fitzgibbon,et al.  Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[74]  A. Baskurt,et al.  Improving Zernike Moments Comparison for Optimal Similarity and Rotation Angle Retrieval , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[76]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[77]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[78]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[79]  Martin Humenberger,et al.  R2D2: Reliable and Repeatable Detector and Descriptor , 2019, NeurIPS.

[80]  Jan-Michael Frahm,et al.  Reconstructing the World* in Six Days *(As Captured by the Yahoo 100 Million Image Dataset) , 2015, CVPR 2015.

[81]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[82]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[83]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[84]  J. Mcqueen Some methods for classi cation and analysis of multivariate observations , 1967 .