论文信息 - Semantically-Aware Attentive Neural Embeddings for 2D Long-Term Visual Localization

Semantically-Aware Attentive Neural Embeddings for 2D Long-Term Visual Localization

We present an approach that combines appearance and semantic information for 2D image-based localization (2D-VL) across large perceptual changes and time lags. Compared to appearance features, the semantic layout of a scene is generally more invariant to appearance variations. We use this intuition and propose a novel end-to-end deep attention-based framework that utilizes multimodal cues to generate robust embeddings for 2D-VL. The proposed attention module predicts a shared channel attention and modality-specific spatial attentions to guide the embeddings to focus on more reliable image regions. We evaluate our model against state-of-the-art (SOTA) methods on three challenging localization datasets. We report an average (absolute) improvement of $19\%$ over current SOTA for 2D-VL. Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing $8$--$15\%$ and $4\%$ improvement from adding semantic information and our proposed attention module. We finally show the predicted attention maps to offer useful insights into our model.

[1] Luc Van Gool,et al. Night-to-Day Image Translation for Retrieval-based Localization , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[2] In-So Kweon,et al. CBAM: Convolutional Block Attention Module , 2018, ECCV.

[3] Torsten Sattler,et al. Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Javier González,et al. Training a Convolutional Neural Network for Appearance-Invariant Place Recognition , 2015, ArXiv.

[5] Torsten Sattler,et al. Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6] Gordon Wyeth,et al. SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights , 2012, 2012 IEEE International Conference on Robotics and Automation.

[7] Michael Milford,et al. Convolutional Neural Network-based Place Recognition , 2014, ICRA 2014.

[8] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Bolei Zhou,et al. Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Tomás Pajdla,et al. Avoiding Confusing Features in Place Recognition , 2010, ECCV.

[11] Sergey Levine,et al. Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[12] Lingqiao Liu,et al. Learning Context Flexible Attention Model for Long-Term Visual Place Recognition , 2018, IEEE Robotics and Automation Letters.

[13] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[14] Girish Chowdhary,et al. GPS‐denied Indoor and Outdoor Monocular Vision Aided Navigation and Control of Unmanned Aircraft , 2013, J. Field Robotics.

[15] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[16] Wolfram Burgard,et al. VLocNet++: Deep Multitask Learning for Semantic Visual Localization and Odometry , 2018, IEEE Robotics and Automation Letters.

[17] Valérie Gouet-Brunet,et al. A survey on Visual-Based Localization: On the benefit of heterogeneous data , 2018, Pattern Recognit..

[18] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19] Masatoshi Okutomi,et al. 24/7 Place Recognition by View Synthesis , 2015, CVPR.

[20] Masatoshi Okutomi,et al. Visual Place Recognition with Repetitive Structures , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[22] Josef Sivic,et al. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Paul Newman,et al. 1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[24] Roberto Cipolla,et al. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25] Jiong Wang,et al. Attention-based Pyramid Aggregation Network for Visual Place Recognition , 2018, ACM Multimedia.

[26] Xiaogang Wang,et al. Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Fredrik Kahl,et al. City-Scale Localization for Cameras with Known Vertical Direction , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Ajay Divakaran,et al. Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention , 2018, ArXiv.

[29] Richard Szeliski,et al. Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[30] Torsten Sattler,et al. Semantic Visual Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Paul Newman,et al. Appearance-only SLAM at large scale with FAB-MAP 2.0 , 2011, Int. J. Robotics Res..

[32] Torsten Sattler,et al. Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Michael F. Cohen,et al. Real-time image-based 6-DOF localization in large-scale environments , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35] Jan-Michael Frahm,et al. From structure-from-motion point clouds to fast location recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Victor S. Lempitsky,et al. Aggregating Local Deep Features for Image Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37] Inkyu Sa,et al. Only look once, mining distinctive landmarks from ConvNet for visual place recognition , 2017, IROS 2017.

[38] Alexander J. Smola,et al. Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Niko Sünderhauf,et al. On the performance of ConvNet features for place recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40] Carl Olsson,et al. Long-Term 3D Localization and Pose from Semantic Labellings , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[41] Andrew Zisserman,et al. DisLocation: Scalable Descriptor Distinctiveness for Location Recognition , 2014, ACCV.

[42] Gordon Wyeth,et al. FAB-MAP + RatSLAM: Appearance-based SLAM for multiple times of day , 2010, 2010 IEEE International Conference on Robotics and Automation.

[43] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[44] Wolfram Burgard,et al. Semantics-aware visual localization under challenging perceptual conditions , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[45] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Michael Milford,et al. LoST? Appearance-Invariant Place Recognition for Opposite Viewpoints using Visual Semantics , 2018, Robotics: Science and Systems.

[48] Lars Hammarstrand,et al. Long-Term Visual Localization Using Semantically Segmented Images , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[49] Jiasen Lu,et al. VQA: Visual Question Answering , 2015, ICCV.

[50] Ronan Sicre,et al. Particular object retrieval with integral max-pooling of CNN activations , 2015, ICLR.

[51] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[52] Robert Pless,et al. Consistent Temporal Variations in Many Outdoor Scenes , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[53] Carlos D. Castillo,et al. L2-constrained Softmax Loss for Discriminative Face Verification , 2017, ArXiv.

[54] Andrew Zisserman,et al. Visual Vocabulary with a Semantic Twist , 2014, ACCV.

[55] Pascal Fua,et al. Worldwide Pose Estimation Using 3D Point Clouds , 2012, ECCV.

[56] James J. Little,et al. Mobile Robot Localization and Mapping with Uncertainty using Scale-Invariant Visual Landmarks , 2002, Int. J. Robotics Res..

[57] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[58] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Gordon Wyeth,et al. Transforming morning to afternoon using linear regression techniques , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[60] Niko Sünderhauf,et al. Superpixel-based appearance change prediction for long-term navigation across seasons , 2014, Robotics Auton. Syst..

[61] Wolfram Burgard,et al. Robust Visual Robot Localization Across Seasons Using Network Flows , 2014, AAAI.

[62] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Michael Milford,et al. Deep learning features at scale for visual place recognition , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[64] Larry S. Davis,et al. Exploiting local features from deep networks for image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[65] Paul Newman,et al. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance , 2008, Int. J. Robotics Res..

[66] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67] Anton van den Hengel,et al. Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68] Torsten Sattler,et al. Camera Pose Voting for Large-Scale Image-Based Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).