Fusion++: Volumetric Object-Level SLAM

We propose an online object-level SLAM system which builds a persistent and accurate 3D graph map of arbitrary reconstructed objects. As an RGB-D camera browses a cluttered indoor scene, Mask-RCNN instance segmentations are used to initialise compact per-object Truncated Signed Distance Function (TSDF) reconstructions with object size-dependent resolutions and a novel 3D foreground mask. Reconstructed objects are stored in an optimisable 6DoF pose graph which is our only persistent map representation. Objects are incrementally refined via depth fusion, and are used for tracking, relocalisation and loop closure detection. Loop closures cause adjustments in the relative pose estimates of object instances, but no intra-object warping. Each object also carries semantic information which is refined over time and an existence probability to account for spurious instance predictions. We demonstrate our approach on a hand-held RGB-D sequence from a cluttered office scene with a large number and variety of object instances, highlighting how the system closes loops and makes good use of existing objects on repeated loops. We quantitatively evaluate the trajectory error of our system against a baseline approach on the RGB-D SLAM benchmark, and qualitatively compare reconstruction quality of discovered objects on the YCB video dataset. Performance evaluation shows our approach is highly memory efficient and runs online at 4-8Hz (excluding relocalisation) despite not being optimised at the software level.

[1]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[2]  Vladlen Koltun,et al.  Robust reconstruction of indoor scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Siddhartha S. Srinivasa,et al.  Exploiting domain knowledge for Object Discovery , 2013, 2013 IEEE International Conference on Robotics and Automation.

[5]  John J. Leonard,et al.  Kintinuous: Spatially Extended KinectFusion , 2012, AAAI 2012.

[6]  Andrew J. Davison,et al.  DTAM: Dense tracking and mapping in real-time , 2011, 2011 International Conference on Computer Vision.

[7]  Duc Thanh Nguyen,et al.  Real-Time Progressive 3D Semantic Segmentation for Indoor Scenes , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8]  Jörg Stückler,et al.  Model Learning and Real-Time Tracking Using Multi-Resolution Surfel Maps , 2012, AAAI.

[9]  Tom Drummond,et al.  MO-SLAM: Multi object SLAM with run-time object discovery through duplicates , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Paul H. J. Kelly,et al.  Dense planar SLAM , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[11]  John J. Leonard,et al.  Real-time large-scale dense RGB-D SLAM with volumetric fusion , 2014, Int. J. Robotics Res..

[12]  Roland Siegwart,et al.  TSDF-based change detection for consistent long-term dense reconstruction and dynamic object discovery , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Jörg Stückler,et al.  Hierarchical Object Discovery and Dense Modelling From Motion Cues in RGB-D Video , 2013, IJCAI.

[15]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[16]  Vladlen Koltun,et al.  Elastic Fragments for Dense Scene Reconstruction , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Lu Ma,et al.  Unsupervised Dense Object Discovery, Detection, Tracking and Reconstruction , 2014, ECCV.

[18]  Roland Siegwart,et al.  BRISK: Binary Robust invariant scalable keypoints , 2011, 2011 International Conference on Computer Vision.

[19]  Vladlen Koltun,et al.  Dense scene reconstruction with points of interest , 2013, ACM Trans. Graph..

[20]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[21]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[22]  Matthias Nießner,et al.  ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Wolfram Burgard,et al.  A benchmark for the evaluation of RGB-D SLAM systems , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[25]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[26]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[27]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[28]  Laurent Kneip,et al.  OpenGV: A unified and generalized approach to real-time calibrated geometric vision , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Henrik I. Christensen,et al.  Efficient Organized Point Cloud Segmentation with Connected Components , 2013 .

[30]  Juan D. Tardós,et al.  Probabilistic Semi-Dense Mapping from Highly Accurate Feature-Based Monocular SLAM , 2015, Robotics: Science and Systems.

[31]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Lourdes Agapito,et al.  MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects , 2018, 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[34]  Siddhartha S. Srinivasa,et al.  The YCB object and Model set: Towards common benchmarks for manipulation research , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[35]  John J. Leonard,et al.  Monocular SLAM Supported Object Recognition , 2015, Robotics: Science and Systems.

[36]  Nassir Navab,et al.  When 2.5D is not enough: Simultaneous reconstruction, segmentation and recognition on dense SLAM , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Michael Milford,et al.  Meaningful maps with object-oriented semantic mapping , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[39]  Frank Dellaert,et al.  SLAM with object discovery, modeling and mapping , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[40]  John J. Leonard,et al.  Toward lifelong object segmentation from change detection in dense RGB-D maps , 2013, 2013 European Conference on Mobile Robots.

[41]  Wolfram Burgard,et al.  G2o: A general framework for graph optimization , 2011, 2011 IEEE International Conference on Robotics and Automation.

[42]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.