SemanticFusion: Dense 3D semantic mapping with convolutional neural networks

Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need to extend beyond geometry and appearance — they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state-of-the-art dense Simultaneous Localization and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondences between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D semantic labelling over baseline single frame predictions. We also show that for a smaller reconstruction dataset with larger variation in prediction viewpoint, the improvement over single frame segmentation increases. Our system is efficient enough to allow real-time interactive use at frame-rates of ≈25Hz.

[1]  Paul H. J. Kelly,et al.  Dense planar SLAM , 2014, 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR).

[2]  Jörg Stückler,et al.  Multi-resolution surfel maps for efficient dense 3D modeling and tracking , 2014, J. Vis. Commun. Image Represent..

[3]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Thorsten Joachims,et al.  Semantic Labeling of 3D Point Clouds for Indoor Scenes , 2011, NIPS.

[5]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[6]  Jörg Stückler,et al.  Dense real-time mapping of object-class semantics from RGB-D video , 2013, Journal of Real-Time Image Processing.

[7]  Roberto Cipolla,et al.  SceneNet: Understanding Real World Indoor Scenes With Synthetic Data , 2015, ArXiv.

[8]  Ali Shahrokni,et al.  Mesh Based Semantic Modelling for Indoor and Outdoor Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[11]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[12]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[16]  Patrick Pérez,et al.  Incremental dense semantic stereo fusion for large-scale semantic scene reconstruction , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[20]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[21]  Dani Lischinski,et al.  Colorization using optimization , 2004, ACM Trans. Graph..

[22]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[23]  Trevor Darrell,et al.  Cross-modal adaptation for RGB-D detection , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[25]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[27]  Jitendra Malik,et al.  Aligning 3D models to RGB-D images of cluttered scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).