A Real-Time Online Learning Framework for Joint 3D Reconstruction and Semantic Segmentation of Indoor Scenes

This letter presents a real-time online vision framework to jointly recover an indoor scene’s 3D structure and semantic label. Given noisy depth maps, a camera trajectory, and 2D semantic labels at train time, the proposed deep neural network based approach learns to fuse the depth over frames with suitable semantic labels in the scene space. Our approach exploits the joint volumetric representation of the depth and semantics in the scene feature space to solve this task. For a compelling online fusion of the semantic labels and geometry in real-time, we introduce an efficient vortex pooling block while dropping the use of routing network in online depth fusion to preserve high-frequency surface details. We show that the context information provided by the semantics of the scene helps the depth fusion network learn noise-resistant features. Not only that, it helps overcome the shortcomings of the current online depth fusion method in dealing with thin object structures, thickening artifacts, and false surfaces. Experimental evaluation on the Replica dataset shows that our approach can perform depth fusion at 37 and 10 frames per second with an average reconstruction F-score of 88% and 91%, respectively, depending on the depth map resolution. Moreover, our model shows an average IoU score of 0.515 on the ScanNet 3D semantic benchmark leaderboard. Code and example dataset information is available at https://github.com/suryanshkumar/online-joint-depthfusion-and-semantic.

[1]  Silvio Savarese,et al.  4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Nassir Navab,et al.  Fully-Convolutional Point Networks for Large-Scale Point Clouds , 2018, ECCV.

[3]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[4]  Peter I. Corke,et al.  A tutorial on visual servo control , 1996, IEEE Trans. Robotics Autom..

[5]  K. Madhava Krishna,et al.  A Bayes filter based adaptive floor segmentation with homography and appearance cues , 2012, ICVGIP '12.

[6]  K. Madhava Krishna,et al.  Markov Random Field based small obstacle discovery over images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Marc Pollefeys,et al.  RoutedFusion: Learning Real-Time Depth Map Fusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Marc Levoy,et al.  A volumetric method for building complex models from range images , 1996, SIGGRAPH.

[9]  Kai Xu,et al.  Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Luxin Han,et al.  FIESTA: Fast Incremental Euclidean Distance Fields for Online Motion Planning of Aerial Robots , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[11]  Niko Sünderhauf,et al.  The Robotic Vision Scene Understanding Challenge , 2020, ArXiv.

[12]  Wolfram Burgard,et al.  Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[13]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[15]  Tomoya Ishikawa,et al.  PanopticFusion: Online Volumetric Semantic Mapping at the Level of Stuff and Things , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[16]  Matthias Nießner,et al.  State of the Art on 3D Reconstruction with RGB‐D Cameras , 2018, Comput. Graph. Forum.

[17]  Matthias Nießner,et al.  3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation , 2018, ECCV.

[18]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  K. Madhava Krishna,et al.  Small Object Discovery and Recognition Using Actively Guided Robot , 2014, 2014 22nd International Conference on Pattern Recognition.

[20]  Roland Siegwart,et al.  Voxblox: Incremental 3D Euclidean Signed Distance Fields for on-board MAV planning , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[21]  Tian Zheng,et al.  Live Semantic 3D Perception for Immersive Augmented Reality , 2020, IEEE Transactions on Visualization and Computer Graphics.

[22]  Hongdong Li,et al.  Superpixel Soup: Monocular Dense 3D Reconstruction of a Complex Dynamic Scene , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Stefan Leutenegger,et al.  SemanticFusion: Dense 3D semantic mapping with convolutional neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Marc Pollefeys,et al.  Joint 3D Scene Reconstruction and Class Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jiaya Jia,et al.  Bidirectional Projection Network for Cross Dimension Scene Understanding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael M. Kazhdan,et al.  Screened poisson surface reconstruction , 2013, TOGS.

[27]  Simon Fuhrmann,et al.  Fusion of depth maps with multiple scales , 2011, ACM Trans. Graph..

[28]  Brian Yamauchi,et al.  A frontier-based approach for autonomous exploration , 1997, Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'.

[29]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[30]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.

[31]  Thomas Funkhouser,et al.  Virtual Multi-view Fusion for 3D Semantic Segmentation , 2020, ECCV.

[32]  Laurens van der Maaten,et al.  3D Semantic Segmentation with Submanifold Sparse Convolutional Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Luigi di Stefano,et al.  SemanticFusion: Joint Labeling, Tracking and Mapping , 2016, ECCV Workshops.

[34]  Cyrill Stachniss,et al.  Self-supervised obstacle detection for humanoid navigation using monocular vision and sparse laser data , 2011, 2011 IEEE International Conference on Robotics and Automation.

[35]  Federico Tombari,et al.  CNN-SLAM: Real-Time Dense Monocular SLAM with Learned Depth Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time dynamic 3D surface reconstruction and interaction , 2011, SIGGRAPH '11.

[37]  Daniel Cremers,et al.  Large-Scale Multi-resolution Surface Reconstruction from RGB-D Sequences , 2013, 2013 IEEE International Conference on Computer Vision.

[38]  Olaf Kähler,et al.  Hierarchical Voxel Block Hashing for Efficient Integration of Depth Images , 2016, IEEE Robotics and Automation Letters.

[39]  K. Madhava Krishna,et al.  CRF Based Frontier Detection using Monocular Camera , 2014, ICVGIP '14.

[40]  Vladlen Koltun,et al.  Tangent Convolutions for Dense Prediction in 3D , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Duc Thanh Nguyen,et al.  Real-Time Progressive 3D Semantic Segmentation for Indoor Scenes , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[42]  Bastian Goldlücke,et al.  An Efficient Octree Design for Local Variational Range Image Fusion , 2017, GCPR.

[43]  Hongdong Li,et al.  Dense Depth Estimation of a Complex Dynamic Scene without Explicit 3D Motion Estimation , 2019, 1902.03791.

[44]  Ali Shahrokni,et al.  Urban 3D semantic modelling using stereo vision , 2013, 2013 IEEE International Conference on Robotics and Automation.

[45]  K. Madhava Krishna,et al.  Dynamic body VSLAM with semantic constraints , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46]  Matthias Nießner,et al.  3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  James M. Rehg,et al.  Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[48]  Jianxin Wu,et al.  Vortex Pooling: Improving Context Representation in Semantic Segmentation , 2018, ArXiv.