Multi-Modal Visual Place Recognition in Dynamics-Invariant Perception Space

Visual place recognition is one of the essential and challenging problems in the fields of robotics. In this letter, we for the first time explore the use of multi-modal fusion of semantic and visual modalities in dynamics-invariant space to improve place recognition in dynamic environments. We achieve this by first designing a novel deep learning architecture to generate the static semantic segmentation and recover the static image directly from the corresponding dynamic image. We then innovatively leverage the spatial-pyramid-matching model to encode the static semantic segmentation into feature vectors. In parallel, the static image is encoded using the popular Bag-of-words model. On the basis of the above multi-modal features, we finally measure the similarity between the query image and target landmark by the joint similarity of their semantic and visual codes. Extensive experiments demonstrate the effectiveness and robustness of the proposed approach for place recognition in dynamic environments.

[1]  Lei Sun,et al.  See clearer at night: towards robust nighttime semantic segmentation through day-night image conversion , 2019, Security + Defence.

[2]  Daniel Cremers,et al.  StaticFusion: Background Reconstruction for Dense RGB-D SLAM in Dynamic Environments , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Roland Siegwart,et al.  Empty Cities: Image Inpainting for a Dynamic-Object-Invariant Space , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[4]  Gang Yu,et al.  BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation , 2018, ECCV.

[5]  Tao Liu,et al.  RGB-D SLAM based on semantic information and geometric constraints in indoor dynamic scenes , 2020, Journal of Physics: Conference Series.

[6]  Jianping Shi,et al.  CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Xin Lin,et al.  Style Transfer for Anime Sketches with Enhanced Residual U-net and Auxiliary Classifier GAN , 2017, 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR).

[8]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[9]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[10]  Hongdong Li,et al.  Optimal Feature Transport for Cross-View Image Geo-Localization , 2019, AAAI.

[11]  Jonathan Kelly,et al.  Learning Matchable Image Transformations for Long-Term Metric Visual Localization , 2020, IEEE Robotics and Automation Letters.

[12]  Moustafa Youssef,et al.  SemanticSLAM: Using Environment Landmarks for Unsupervised Indoor Localization , 2016, IEEE Transactions on Mobile Computing.

[13]  Raymond Y. K. Lau,et al.  Least Squares Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Haibin Li,et al.  EACNet: Enhanced Asymmetric Convolution for Real-Time Semantic Segmentation , 2021, IEEE Signal Processing Letters.

[15]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Torsten Sattler,et al.  Semantic Visual Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Daniel Cremers,et al.  Direct Sparse Odometry , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Ling Huang,et al.  Saliency-based multi-feature modeling for semantic image retrieval , 2018, J. Vis. Commun. Image Represent..

[19]  Pengwei Xie,et al.  Non-Local Aggregation for RGB-D Semantic Segmentation , 2021, IEEE Signal Processing Letters.

[20]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[21]  Eduardo Romera,et al.  ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation , 2018, IEEE Transactions on Intelligent Transportation Systems.

[22]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  Ji Shi,et al.  Robust Neural Routing Through Space Partitions for Camera Relocalization in Dynamic Indoor Environments , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[25]  Hongdong Li,et al.  Lending Orientation to Neural Networks for Cross-View Geo-Localization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Torsten Sattler,et al.  Beyond Controlled Environments: 3D Camera Re-Localization in Changing Indoor Scenes , 2020, ECCV.

[27]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[28]  Berta Bescos,et al.  Empty Cities: A Dynamic-Object-Invariant Space for Visual SLAM , 2020, IEEE Transactions on Robotics.

[29]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Zheng Rong,et al.  Dynamic-SLAM: Semantic monocular visual localization and mapping based on deep learning in dynamic environment , 2019, Robotics Auton. Syst..

[31]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.