Semi-Dense 3D Semantic Mapping from Monocular SLAM

The bundle of geometry and appearance in computer vision has proven to be a promising solution for robots across a wide variety of applications. Stereo cameras and RGB-D sensors are widely used to realise fast 3D reconstruction and trajectory tracking in a dense way. However, they lack flexibility of seamless switch between different scaled environments, i.e., indoor and outdoor scenes. In addition, semantic information are still hard to acquire in a 3D mapping. We address this challenge by combining the state-of-art deep learning method and semi-dense Simultaneous Localisation and Mapping (SLAM) based on video stream from a monocular camera. In our approach, 2D semantic information are transferred to 3D mapping via correspondence between connective Keyframes with spatial consistency. There is no need to obtain a semantic segmentation for each frame in a sequence, so that it could achieve a reasonable computation time. We evaluate our method on indoor/outdoor datasets and lead to an improvement in the 2D semantic labelling over baseline single frame predictions.

[1]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Daniel Cremers,et al.  LSD-SLAM: Large-Scale Direct Monocular SLAM , 2014, ECCV.

[3]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[4]  Davide Scaramuzza,et al.  SVO: Fast semi-direct monocular visual odometry , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[5]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Ali Shahrokni,et al.  Urban 3D semantic modelling using stereo vision , 2013, 2013 IEEE International Conference on Robotics and Automation.

[7]  Daniel Cremers,et al.  Dense visual SLAM for RGB-D cameras , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Markus Vincze,et al.  Fast semantic segmentation of 3D point clouds using a dense CRF with learned parameters , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Paul H. J. Kelly,et al.  SLAM++: Simultaneous Localisation and Mapping at the Level of Objects , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Stefan Leutenegger,et al.  ElasticFusion: Dense SLAM Without A Pose Graph , 2015, Robotics: Science and Systems.

[11]  Luc Van Gool,et al.  3D all the way: Semantic segmentation of urban scenes from start to end in 3D , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[13]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Olivier Stasse,et al.  MonoSLAM: Real-Time Single Camera SLAM , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[16]  J. M. M. Montiel,et al.  ORB-SLAM: A Versatile and Accurate Monocular SLAM System , 2015, IEEE Transactions on Robotics.

[17]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Daniel Cremers,et al.  Semi-dense Visual Odometry for a Monocular Camera , 2013, 2013 IEEE International Conference on Computer Vision.