$R^{2}$ Former: Unified Retrieval and Reranking Transformer for Place Recognition

Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named $R^{2}$Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, $R^{2}$Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former.

[1]  Yingbin Zheng,et al.  ETR: An Efficient Transformer for Re-ranking in Visual Place Recognition , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2]  B. Chaib-draa,et al.  MixVPR: Feature Mixing for Visual Place Recognition , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[3]  Jie Yang,et al.  TransVLAD: Multi-Scale Attention-Based Global Descriptors for Visual Geo-Localization , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[4]  G. Csurka,et al.  Deep Visual Geo-localization Benchmark , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Barbara Caputo,et al.  Rethinking Visual Geo-localization for Large-Scale Applications , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Euntai Kim,et al.  Correlation Verification for Image Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  M. Shah,et al.  TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Sanping Zhou,et al.  TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Carlo Masone,et al.  Viewpoint Invariant Dense Matching for Visual Geolocalization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Hongyan Liu,et al.  SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition , 2021, AAAI.

[11]  Humphrey Shi,et al.  Escaping the Big Data Paradigm with Compact Transformers , 2021, ArXiv.

[12]  Andrea Tagliasacchi,et al.  COTR: Correspondence Transformer for Matching Across Images , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Vicente Ordonez,et al.  Instance-level Image Retrieval using Reranking Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Michael Milford,et al.  Where is your place, Visual Place Recognition? , 2021, IJCAI.

[15]  Michael Milford,et al.  Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Michael Milford,et al.  SeqNet: Learning Descriptors for Sequence-Based Hierarchical Place Recognition , 2021, IEEE Robotics and Automation Letters.

[17]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[18]  Lei Wang,et al.  Visual place recognition: A survey from deep learning perspective , 2020, Pattern Recognit..

[19]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20]  Barbara Caputo,et al.  Adaptive-Attentive Geolocalization From Few Queries: A Hybrid Approach , 2020, Frontiers in Computer Science.

[21]  Ankush Gupta,et al.  CrossTransformers: spatially-aware few-shot transfer , 2020, NeurIPS.

[22]  Haibo Wang,et al.  Self-supervising Fine-grained Region Similarities for Large-scale Image Localization , 2020, ECCV.

[23]  Yubin Kuang,et al.  Mapillary Street-Level Sequences: A Dataset for Lifelong Place Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Tobias Weyand,et al.  Google Landmarks Dataset v2 – A Large-Scale Benchmark for Instance-Level Recognition and Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Matthew Gadd,et al.  Look Around You: Sequence-based Radar Place Recognition with Learned Rotational Invariance , 2020, 2020 IEEE/ION Position, Location and Navigation Symposium (PLANS).

[26]  Jack Sim,et al.  Unifying Deep Local and Global Features for Image Search , 2020, ECCV.

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ziqi Wang,et al.  Attention-Aware Age-Agnostic Visual Place Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[30]  Hongdong Li,et al.  Stochastic Attraction-Repulsion Embedding for Large Scale Image Localization , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Raia Hadsell,et al.  Learning to Navigate in Cities Without a Map , 2018, NeurIPS.

[32]  Tomasz Malisiewicz,et al.  SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[34]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Sebastien Glaser,et al.  Simultaneous Localization and Mapping: A Survey of Current Trends in Autonomous Driving , 2017, IEEE Transactions on Intelligent Vehicles.

[36]  Jan-Michael Frahm,et al.  Learned Contextual Feature Reweighting for Image Geo-Localization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Torsten Sattler,et al.  Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Albert Gordo,et al.  End-to-End Learning of Deep Visual Representations for Image Retrieval , 2016, International Journal of Computer Vision.

[40]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[41]  Peter I. Corke,et al.  Visual Place Recognition: A Survey , 2016, IEEE Transactions on Robotics.

[42]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  T. Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, Computer Vision and Pattern Recognition.

[45]  Masatoshi Okutomi,et al.  24/7 Place Recognition by View Synthesis , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Mubarak Shah,et al.  Image Geo-Localization Based on MultipleNearest Neighbor Feature Matching UsingGeneralized Graphs , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Simon Lacroix,et al.  Probabilistic place recognition with covisibility maps , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[49]  Masatoshi Okutomi,et al.  Visual Place Recognition with Repetitive Structures , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Pascal Fua,et al.  Worldwide Pose Estimation Using 3D Point Clouds , 2012, ECCV.

[51]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Xin Chen,et al.  City-scale landmark identification on mobile devices , 2011, CVPR 2011.

[53]  Paul Newman,et al.  Highly scalable appearance-only SLAM - FAB-MAP 2.0 , 2009, Robotics: Science and Systems.

[54]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Jean-Arcady Meyer,et al.  Fast and Incremental Method for Loop-Closure Detection Using Bags of Visual Words , 2008, IEEE Transactions on Robotics.

[56]  Gordon Wyeth,et al.  Mapping a Suburb With a Single Camera Using a Biologically Inspired SLAM System , 2008, IEEE Transactions on Robotics.

[57]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[58]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[59]  Barbara Caputo,et al.  A Survey on Deep Visual Place Recognition , 2021, IEEE Access.

[60]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.