Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery

To autonomously navigate and plan interactions in real-world environments, robots require the ability to robustly perceive and map complex, unstructured surrounding scenes. Besides building an internal representation of the observed scene geometry, the key insight toward a truly functional understanding of the environment is the usage of higher level entities during mapping, such as individual object instances. This work presents an approach to incrementally build volumetric object-centric maps during online scanning with a localized RGB-D camera. First, a per-frame segmentation scheme combines an unsupervised geometric approach with instance-aware semantic predictions to detect both recognized scene elements as well as previously unseen objects. Next, a data association step tracks the predicted instances across the different frames. Finally, a map integration strategy fuses information about their 3D shape, location, and, if available, semantic class into a global volume. Evaluation on a publicly available dataset shows that the proposed approach for building instance-level semantic maps is competitive with state-of-the-art methods, while additionally able to discover objects of unseen categories. The system is further evaluated within a real-world robotic mapping setup, for which qualitative results highlight the online nature of the method. Code is available at https://github.com/ethz-asl/voxblox-plusplus.

[1]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Nassir Navab,et al.  When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Lourdes Agapito,et al.  MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects , 2018, 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Hideo Saito,et al.  Efficient Object-Oriented Semantic Mapping With Object Detector , 2019, IEEE Access.

[6]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Abel Gawel,et al.  Incremental Object Database: Building 3D Models from Multiple Partial Observations , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[8]  Gustavo Carneiro,et al.  Bayesian Semantic Instance Segmentation in Open Set World , 2018, ECCV.

[9]  Trevor Darrell,et al.  Learning to Segment Every Thing , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Wolfram Burgard,et al.  Hierarchies of octrees for efficient 3D mapping , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Roland Siegwart,et al.  Voxblox: Incremental 3D Euclidean Signed Distance Fields for on-board MAV planning , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Ian D. Reid,et al.  SceneCut: Joint Geometric and Object Segmentation for Indoor Scenes , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Stefan Leutenegger,et al.  Fusion++: Volumetric Object-Level SLAM , 2018, 2018 International Conference on 3D Vision (3DV).

[14]  Federico Tombari,et al.  Real-time and scalable incremental segmentation on dense SLAM , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Duc Thanh Nguyen,et al.  SceneNN: A Scene Meshes Dataset with aNNotations , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[16]  Abel Gawel,et al.  Incremental Object Database: Building 3D Models from Multiple Partial Observations , 2018, IROS 2018.

[17]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Hideo Saito,et al.  Fast and Accurate Semantic Mapping through Geometric-based Incremental Segmentation , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[19]  Michael Milford,et al.  Meaningful maps with object-oriented semantic mapping , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Tim Weyrich,et al.  Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion , 2013, 2013 International Conference on 3D Vision.

[21]  Charles C. Kemp,et al.  Challenges for robot manipulation in human environments [Grand Challenges of Robotics] , 2007, IEEE Robotics & Automation Magazine.

[22]  Olaf Kähler,et al.  InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure , 2017, ArXiv.

[23]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[24]  Roland Siegwart,et al.  A synchronized visual-inertial sensor system with FPGA pre-processing for accurate real-time SLAM , 2014, ICRA 2014.

[25]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[26]  Stefan Leutenegger,et al.  Efficient Octree-Based Volumetric SLAM Supporting Signed-Distance and Occupancy Mapping , 2018, IEEE Robotics and Automation Letters.

[27]  Duc Thanh Nguyen,et al.  Real-Time Progressive 3D Semantic Segmentation for Indoor Scenes , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29]  Roland Siegwart,et al.  Maplab: An Open Framework for Research in Visual-Inertial Mapping and Localization , 2017, IEEE Robotics and Automation Letters.