Simple, Effective and General: A New Backbone for Cross-view Image Geo-localization

In this work, we aim at an important but less explored problem of a simple yet effective backbone specific for cross-view geo-localization task. Existing methods for cross-view geo-localization tasks are frequently characterized by 1) complicated methodologies, 2) GPU-consuming computations, and 3) a stringent assumption that aerial and ground images are centrally or orientation aligned. To address the above three challenges for cross-view image matching, we propose a new backbone network, named Simple Attention-based Image Geo-localization network (SAIG). The proposed SAIG effectively represents long-range interactions among patches as well as cross-view correspondence with multi-head self-attention layers. The"narrow-deep"architecture of our SAIG improves the feature richness without degradation in performance, while its shallow and effective convolutional stem preserves the locality, eliminating the loss of patchify boundary information. Our SAIG achieves state-of-the-art results on cross-view geo-localization, while being far simpler than previous works. Furthermore, with only 15.9% of the model parameters and half of the output dimension compared to the state-of-the-art, the SAIG adapts well across multiple cross-view datasets without employing any well-designed feature aggregation modules or feature alignment algorithms. In addition, our SAIG attains competitive scores on image retrieval benchmarks, further demonstrating its generalizability. As a backbone network, our SAIG is both easy to follow and computationally lightweight, which is meaningful in practical scenario. Moreover, we propose a simple Spatial-Mixed feature aggregation moDule (SMD) that can mix and project spatial information into a low-dimensional space to generate feature descriptors... (The code is available at https://github.com/yanghongji2007/SAIG)

[1]  M. Shah,et al.  TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[3]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[4]  Qunjie Zhou,et al.  Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[6]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Ivan Laptev,et al.  Training Vision Transformers for Image Retrieval , 2021, ArXiv.

[8]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[10]  Feng Wang,et al.  Understanding the Behaviour of Contrastive Loss , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Sijie Zhu,et al.  VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[13]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[14]  Jiyong Zhang,et al.  Each Part Matters: Local Patterns Facilitate Cross-View Geo-Localization , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  Xin Yu,et al.  Where Am I Looking At? Joint Location and Orientation Estimation by Cross-View Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yunchao Wei,et al.  University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization , 2020, ACM Multimedia.

[17]  Salman Khan,et al.  Ground-to-Aerial Image Geo-Localization With a Hard Exemplar Reweighting Triplet Loss , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Hongdong Li,et al.  Optimal Feature Transport for Cross-View Image Geo-Localization , 2019, AAAI.

[19]  Chen Chen,et al.  GEOCAPSNET: Ground to Aerial View Image Geo-Localization using Capsule Network , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[20]  Mubarak Shah,et al.  Bridging the Domain Gap for Ground-to-Aerial Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Hongdong Li,et al.  Lending Orientation to Neural Networks for Cross-View Geo-Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Gim Hee Lee,et al.  Image-Based Geo-Localization Using Satellite Imagery , 2019, International Journal of Computer Vision.

[23]  Yannis Avrithis,et al.  Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Jian Wang,et al.  Deep Metric Learning with Angular Loss , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Kaiqi Huang,et al.  Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[29]  Scott Workman,et al.  Predicting Ground-Level Scene Layout from Aerial Imagery , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  James Hays,et al.  Localizing and Orienting Street Views Using Overhead Imagery , 2016, ECCV.

[32]  Gang Wang,et al.  Gated Siamese Convolutional Neural Network Architecture for Human Re-identification , 2016, ECCV.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Scott Workman,et al.  Wide-Area Image Geolocalization with Aerial Reference Imagery , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Scott Workman,et al.  On the location dependence of convolutional neural network features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Daniel Huber,et al.  Vision based robot localization by ground to satellite matching in GPS-denied situations , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Ahmed M. Elgammal,et al.  Satellite image based precise robot localization on sidewalks , 2012, 2012 IEEE International Conference on Robotics and Automation.

[39]  Hui Cheng,et al.  Geo-localization of street views with aerial image databases , 2011, ACM Multimedia.

[40]  Ahmed M. Elgammal,et al.  A framework for global vehicle localization using stereo images and satellite and road maps , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[41]  Sen Jia,et al.  Geographic Semantic Network for Cross-View Image Geo-Localization , 2022, IEEE Transactions on Geoscience and Remote Sensing.

[42]  Ying J. Zhu,et al.  Cross-view Geo-localization with Layer-to-Layer Transformer , 2021, NeurIPS.

[43]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Xin Yu,et al.  Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization , 2019, NeurIPS.