AM2FNet: Attention-based Multiscale & Multi-modality Fused Network

How to infer the 3D geometries and 3D semantic labels for each unit in a scene, including visible surfaces and occluded parts, is an important issue in many robotic fields. In recent years, there exists some studies on segmenting and completing 3D scene from 2D information. Most of them complete a scene from a single depth image. Compared with the depth image, the RGB image contains more color features and contour features, which can help to semantic labeling. However, how to design an effective strategy to fuse RGB and depth features is a challenge issue. Our paper presents an attention-based multi-scale & multi-modality fused network, called AM2FNet, which includes six modules: depth feature module, color feature module, 3D integration module for multi-modality feature fusion, 3D refinement module for multi-scale feature fusion, attention modules, semantic mapping module. The integration module and the refinement module work together in 3D space to fuse color and depth features at low-level, middle-level and high-level in a top-down fashion. In addition, we use an attention module to efficiently bias input-related features. Experimental results show that our proposed network can generate higher-quality semantic scene completion (SSC) results and scene completion (SC) results, and outperforms the state-of-the-art methods on real NYU and synthetic NYUCAD datasets. Meanwhilethe contribions of single modules have been illustrated.

[1]  Yu Liu,et al.  RGBD Based Dimensional Decomposition Residual Network for 3D Semantic Scene Completion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xin Tong,et al.  View-Volume Network for Semantic Scene Completion from a Single Depth Image , 2018, IJCAI.

[3]  Seungyong Lee,et al.  RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[5]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[6]  Simon J. Julier,et al.  Structured Prediction of Unobserved Voxels from a Single Depth Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Derek Hoiem,et al.  Predicting Complete 3D Models of Indoor Scenes , 2015, ArXiv.

[8]  Nassir Navab,et al.  Adversarial Semantic Scene Completion from a Single Depth Image , 2018, 2018 International Conference on 3D Vision (3DV).

[9]  Yu Hu,et al.  See and Think: Disentangling Semantic Scene Completion , 2018, NeurIPS.

[10]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Juergen Gall,et al.  Two Stream 3D Semantic Scene Completion , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Marc Pollefeys,et al.  Multimodal Neural Networks: RGB-D for Semantic Segmentation and Object Detection , 2017, SCIA.

[16]  Katsushi Ikeuchi,et al.  Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Hongen Liao,et al.  Efficient Semantic Scene Completion Network with Spatial Group Convolution , 2018, ECCV.

[18]  Adrian Hilton,et al.  Semantic Scene Completion Combining Colour and Depth: preliminary experiments , 2018, ArXiv.

[19]  Juan Song,et al.  Semantic scene completion with dense CRF from a single depth image , 2018, Neurocomputing.