Semantic Localisation via Globally Unique Instance Segmentation

© 2018. The copyright of this document resides with its authors. In this work we propose a novel approach to semantic localisation. Our work is motivated by the need for environment perception techniques which not only perform self-localisation within a map but also simultaneously recognise surrounding objects. Such capabilities are crucial for computer vision applications which interact with the environment: autonomous driving, augmented reality or robotics. In order to achieve this goal we propose a solution which consists of three key steps. Firstly, a database of panoramic RGB images and corresponding globally unique, per-pixel object instance labels is built for the desired environment where we typically consider objects from static categories such as "building" or "tree". Secondly, a semantic segmentation network capable of predicting more than 3000 labels is trained on the collected data. Finally, for a given panoramic query image, the corresponding instance label image predicted by the network is used for semantic matching within the database. The matching is performed in two stages: (i) a fast retrieval of a small subset of database images (~100) with highly overlapping instance label histograms, followed by (ii) an explicit approximate 3 DoF (yaw, pitch, roll) alignment of the selected subset of images and the query image. We evaluate our approach in challenging indoor and outdoor navigation scenarios, achieving better or similar performance when compared to state-of-the-art image retrieval-based localisation approaches using key-point matching [29, 63] and image level embedding [3]. Our contribution includes: (i) a description of a novel semantic localisation approach using globally unique instance segmentation, (ii) corresponding quantitative and qualitative analysis and (iii) a novel CamVid-360 dataset containing 986 labelled instances of buildings, trees, road signs and poles.

[1]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[2]  Martial Hebert,et al.  Data-Driven Scene Understanding from 3D Models , 2012, BMVC.

[3]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[4]  Jan-Michael Frahm,et al.  Learned Contextual Feature Reweighting for Image Geo-Localization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Vladlen Koltun,et al.  Playing for Benchmarks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Timo Pylvänäinen,et al.  Automatic Alignment and MultiView Segmentation of Street View Data using 3 D Shape Priors , 2010 .

[7]  Michal Havlena,et al.  From Google Street View to 3D city models , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[8]  Roberto Cipolla,et al.  PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Sanja Fidler,et al.  Lost Shopping! Monocular Localization in Large Indoor Spaces , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[11]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[12]  Ashish Kapoor,et al.  AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles , 2017, FSR.

[13]  Nicolas Pugeault,et al.  SeDAR - Semantic Detection and Ranging: Humans can Localise without LiDAR, can Robots? , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Thomas A. Funkhouser,et al.  MINOS: Multimodal Indoor Simulator for Navigation in Complex Environments , 2017, ArXiv.

[16]  Bernard Ghanem,et al.  UE4Sim: A Photo-Realistic Simulator for Computer Vision Applications , 2017, ArXiv.

[17]  Ryan M. Eustice,et al.  Visual localization within LIDAR maps for automated urban driving , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  Bohyung Han,et al.  Large-Scale Image Retrieval with Attentive Deep Local Features , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[20]  Sven Behnke,et al.  Fast Object Learning and Dual-arm Coordination for Cluttered Stowing, Picking, and Packing , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Germán Ros,et al.  CARLA: An Open Urban Driving Simulator , 2017, CoRL.

[22]  Qiao Wang,et al.  VirtualWorlds as Proxy for Multi-object Tracking Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sanja Fidler,et al.  HouseCraft: Building Houses from Rental Ads and Street Views , 2016, ECCV.

[24]  Davide Scaramuzza,et al.  Benefit of large field-of-view cameras for visual odometry , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Christian Szegedy,et al.  Large Scale Business Discovery from Street Level Imagery , 2015, ArXiv.

[26]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27]  Anton van den Hengel,et al.  Wider or Deeper: Revisiting the ResNet Model for Visual Recognition , 2016, Pattern Recognit..

[28]  Tomás Pajdla,et al.  Learning and Calibrating Per-Location Classifiers for Visual Place Recognition , 2013, CVPR.

[29]  Marc Pollefeys,et al.  Slanted Stixels: Representing San Francisco's Steepest Streets , 2017, BMVC.

[30]  Ilya Kostrikov,et al.  PlaNet - Photo Geolocation with Convolutional Neural Networks , 2016, ECCV.

[31]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Sven Behnke,et al.  RGB-D object detection and semantic segmentation for autonomous manipulation in clutter , 2018, Int. J. Robotics Res..

[33]  Antonio M. López,et al.  The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[35]  Luc Van Gool,et al.  Semantic Instance Segmentation with a Discriminative Loss Function , 2017, ArXiv.

[36]  Xin Chen,et al.  City-scale landmark identification on mobile devices , 2011, CVPR 2011.

[37]  Paul Newman,et al.  1 year, 1000 km: The Oxford RobotCar dataset , 2017, Int. J. Robotics Res..

[38]  Alex Bewley,et al.  Addressing appearance change in outdoor robotics with adversarial domain adaptation , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[39]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[40]  Vincent Lepetit,et al.  Learning to Align Semantic Segmentation and 2.5D Maps for Geolocalization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jan-Michael Frahm,et al.  Indoor-Outdoor 3D Reconstruction Alignment , 2016, ECCV.

[42]  Silvio Savarese,et al.  Joint 2D-3D-Semantic Data for Indoor Scene Understanding , 2017, ArXiv.

[43]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Masatoshi Okutomi,et al.  Visual Place Recognition with Repetitive Structures , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Tao Zhang,et al.  Robust RGB-D simultaneous localization and mapping using planar point features , 2015, Robotics Auton. Syst..

[47]  Yi Zhang,et al.  UnrealCV: Virtual Worlds for Computer Vision , 2017, ACM Multimedia.

[48]  Torsten Sattler,et al.  Semantic Visual Localization , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Ignas Budvytis,et al.  Large Scale Labelled Video Data Augmentation for Semantic Segmentation in Driving Scenarios , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[50]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[52]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Stefan Leutenegger,et al.  SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Wolfram Burgard,et al.  Semantics-aware visual localization under challenging perceptual conditions , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[55]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Masatoshi Okutomi,et al.  24/7 Place Recognition by View Synthesis , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Vincent Lepetit,et al.  Efficient 3D Tracking in Urban Environments with Semantic Segmentation , 2017, BMVC.

[58]  Matthias Nießner,et al.  Matterport3D: Learning from RGB-D Data in Indoor Environments , 2017, 2017 International Conference on 3D Vision (3DV).

[59]  Richard Szeliski,et al.  High-accuracy stereo depth maps using structured light , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[60]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[61]  Vladlen Koltun,et al.  Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[62]  Vincent Lepetit,et al.  Accurate Camera Registration in Urban Environments Using High-Level Feature Matching , 2017, BMVC.