RFBNet: Deep Multimodal Networks with Residual Fusion Blocks for RGB-D Semantic Segmentation

RGB-D semantic segmentation methods conventionally use two independent encoders to extract features from the RGB and depth data. However, there lacks an effective fusion mechanism to bridge the encoders, for the purpose of fully exploiting the complementary information from multiple modalities. This paper proposes a novel bottom-up interactive fusion structure to model the interdependencies between the encoders. The structure introduces an interaction stream to interconnect the encoders. The interaction stream not only progressively aggregates modality-specific features from the encoders but also computes complementary features for them. To instantiate this structure, the paper proposes a residual fusion block (RFB) to formulate the interdependences of the encoders. The RFB consists of two residual units and one fusion unit with gate mechanism. It learns complementary features for the modality-specific encoders and extracts modality-specific features as well as cross-modal features. Based on the RFB, the paper presents the deep multimodal networks for RGB-D semantic segmentation called RFBNet. The experiments on two datasets demonstrate the effectiveness of modeling the interdependencies and that the RFBNet achieved state-of-the-art performance.

[1]  Luis Herranz,et al.  Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs , 2017, AAAI.

[2]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[3]  Sven Behnke,et al.  Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Wolfram Burgard,et al.  Deep Multispectral Semantic Scene Understanding of Forested Environments Using Multimodal Fusion , 2016, ISER.

[5]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[6]  Eduardo Romera,et al.  ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation , 2018, IEEE Transactions on Intelligent Transportation Systems.

[7]  Luc Van Gool,et al.  Fast Scene Understanding for Autonomous Driving , 2017, ArXiv.

[8]  Fei Luo,et al.  RedNet: Residual Encoder-Decoder Network for indoor RGB-D Semantic Segmentation , 2018, ArXiv.

[9]  Chunxiang Wang,et al.  3D Semantic Modelling With Label Correction For Extensive Outdoor Scene , 2019, 2019 IEEE Intelligent Vehicles Symposium (IV).

[10]  Seungyong Lee,et al.  RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Florent Lafarge,et al.  Pyramid scene parsing network in 3D: Improving semantic segmentation of point clouds with multi-scale contextual information , 2019, ISPRS Journal of Photogrammetry and Remote Sensing.

[12]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[13]  Gang Wang,et al.  Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks , 2016, ECCV.

[14]  Steven Lake Waslander,et al.  In Defense of Classical Image Processing: Fast Depth Completion on the CPU , 2018, 2018 15th Conference on Computer and Robot Vision (CRV).

[15]  Bastian Leibe,et al.  Dense 3D semantic mapping of indoor scenes from RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[16]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[17]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[18]  Lars Hammarstrand,et al.  Long-Term Visual Localization Using Semantically Segmented Images , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[19]  Graham W. Taylor,et al.  Deep Multimodal Learning: A Survey on Recent Advances and Trends , 2017, IEEE Signal Processing Magazine.

[20]  Daniel Cremers,et al.  FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture , 2016, ACCV.

[21]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[22]  Ming Yang,et al.  Restricted Deformable Convolution-Based Road Scene Semantic Segmentation Using Surround View Cameras , 2018, IEEE Transactions on Intelligent Transportation Systems.

[23]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Lei Han,et al.  GFF: Gated Fully Fusion for Semantic Segmentation , 2019, ArXiv.

[25]  Xin Zhao,et al.  Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[28]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[29]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Guodong Zhang,et al.  Nonnegative Matrix Cofactorization for Weakly Supervised Image Parsing , 2016, IEEE Signal Processing Letters.

[31]  Kaiwei Wang,et al.  Can we PASS beyond the Field of View? Panoramic Annular Semantic Segmentation for Real-World Surrounding Perception , 2019, 2019 IEEE Intelligent Vehicles Symposium (IV).

[32]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.

[33]  Matthias Nießner,et al.  ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Wolfram Burgard,et al.  Self-Supervised Model Adaptation for Multimodal Semantic Segmentation , 2018, International Journal of Computer Vision.

[35]  Cyrill Stachniss,et al.  Real-Time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[36]  Matthias Nießner,et al.  3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation , 2018, ECCV.

[37]  Hao Li,et al.  Semantic Segmentation-Based Lane-Level Localization Using Around View Monitoring System , 2019, IEEE Sensors Journal.

[38]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[39]  Ming Yang,et al.  CNN based semantic segmentation for urban traffic scenes using fisheye camera , 2017, 2017 IEEE Intelligent Vehicles Symposium (IV).

[40]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[41]  Shuo Wang,et al.  Weakly Supervised Semantic Segmentation with a Multiscale Model , 2015, IEEE Signal Processing Letters.