Efficient Object-Oriented Semantic Mapping With Object Detector

Incrementally, building a 3D map in which object instances are semantically annotated has a wide range of applications, including scene understanding, human–robot interactions, and simultaneous localization and mapping extensions. Although researchers are developing efficient and accurate systems, these methods still face a critical issue: real-time processing, because the task requires a series of heavy processing components, e.g., camera pose estimation, 3D map reconstruction, and especially recognition. In this paper, we propose a novel object-oriented semantic mapping approach aiming at overcoming such issues by introducing highly accurate object-oriented semantic scene reconstruction in real time. For high efficiency, the proposed method employs a fast and scalable object detection algorithm for exploiting semantic information from the incoming frames. These outputs are integrated into geometric regions of the 3D map, which are carried by the geometric-based incremental segmentation method. The strategy of assigning class probabilities to each segmented region, not each element (e.g., surfels and voxels), notably reduces the computational cost, as well as the memory footprint. In addition to efficiency, by geometrically segmenting the 3D map first, clear boundaries between objects appear. We complementarily improve the geometric-based segmentation results beyond the geometric only to the semantic-aware representation. We validate the proposed method’s accuracy and computational efficiency through experiments in a common office scene.

[1]  Hedvig Kjellström,et al.  Unsupervised object exploration using context , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.

[2]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Stefan Leutenegger,et al.  Fusion++: Volumetric Object-Level SLAM , 2018, 2018 International Conference on 3D Vision (3DV).

[4]  Federico Tombari,et al.  Real-time and scalable incremental segmentation on dense SLAM , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[5]  Shichao Yang,et al.  Semantic 3D occupancy mapping through efficient high order CRFs , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Tim Weyrich,et al.  Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion , 2013, 2013 International Conference on 3D Vision.

[7]  Markus Vincze,et al.  OUR-CVFH - Oriented, Unique and Repeatable Clustered Viewpoint Feature Histogram for Object Recognition and 6DOF Pose Estimation , 2012, DAGM/OAGM Symposium.

[8]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Nassir Navab,et al.  When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[12]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[13]  Helge J. Ritter,et al.  Realtime 3D segmentation for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Babette Dellen,et al.  Depth-supported real-time video segmentation with the Kinect , 2012, 2012 IEEE Workshop on the Applications of Computer Vision (WACV).

[15]  John J. Leonard,et al.  Toward lifelong object segmentation from change detection in dense RGB-D maps , 2013, 2013 European Conference on Mobile Robots.

[16]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[17]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Michael Milford,et al.  Meaningful maps with object-oriented semantic mapping , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Olaf Kähler,et al.  InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure , 2017, ArXiv.

[21]  Jörg Stückler,et al.  Model Learning and Real-Time Tracking Using Multi-Resolution Surfel Maps , 2012, AAAI.

[22]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[23]  Lourdes Agapito,et al.  Co-fusion: Real-time segmentation, tracking and fusion of multiple objects , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Kok-Lim Low Linear Least-Squares Optimization for Point-to-Plane ICP Surface Registration , 2004 .

[26]  Ronan Collobert,et al.  Learning to Refine Object Segments , 2016, ECCV.

[27]  Dieter Fox,et al.  RGB-D object discovery via multi-scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[30]  Robert Haschke,et al.  3D scene segmentation for autonomous robot grasping , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Xuanpeng Li,et al.  Semi-Dense 3D Semantic Mapping from Monocular SLAM , 2016, ArXiv.