A Coarse to Fine Indoor Visual Localization Method Using Environmental Semantic Information

In this paper, we focus on the camera localization problem using visual semantic information. In contrast to the state of the artworks which often use visual features to do localization, we here propose a coarse to a fine mechanism to localize the camera position. First, a semantic database including object information around the target environment is constructed using a deep learning method. Second, for the coarse step of the visual localization, we match class attributes of objects in the current frame to the object database and find candidate frames that have similar objects. Third, the most similar candidate frame to the current frame is selected by CNN features. For the fine step of localization, the final pose of the camera can be estimated using feature matching with semantic information. Compared to the state of the art visual localization methods, the proposed localization method based on semantic information has higher localization accuracy. Furthermore, the proposed framework is not only useful for visual localization, but also useful for other advanced tasks of robot, e.g., loop closing detection, object searching, and task reasoning.

[1]  Niko Sünderhauf,et al.  On the performance of ConvNet features for place recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[2]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[3]  Alex ChiChung Kot,et al.  Efficient Image Sharpness Assessment Based on Content Aware Total Variation , 2016, IEEE Transactions on Multimedia.

[4]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5]  Ethan Rublee,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[6]  Andrew J. Davison,et al.  A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[8]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[9]  Wenjun Zhang,et al.  Automatic Contrast Enhancement Technology With Saliency Preservation , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  John J. Leonard,et al.  Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age , 2016, IEEE Transactions on Robotics.

[14]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[16]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[17]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[18]  Ruizhi Chen,et al.  Indoor Visual Positioning Aided by CNN-Based Image Retrieval: Training-Free, 3D Modeling-Free , 2018, Sensors.

[19]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[22]  Torsten Sattler,et al.  InLoc: Indoor Visual Localization with Dense Matching and View Synthesis , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  King Ngi Ngan,et al.  Blind Image Quality Assessment Based on Multichannel Feature Fusion and Label Transfer , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Xuanpeng Li,et al.  Semi-Dense 3D Semantic Mapping from Monocular SLAM , 2016, ArXiv.

[25]  Stefan Leutenegger,et al.  ElasticFusion: Real-time dense SLAM and light source estimation , 2016, Int. J. Robotics Res..

[26]  Ling Guan,et al.  RGB-D camera pose estimation using deep neural network , 2017, 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[27]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).