In this paper, we study the problem of object detection and segmentation in the cluttered indoor scenes based on RGB-D data. The main issues about object detection and segmentation in the indoor scenes are coming from serious obstruction, inconspicuous classes, and confusion categories. To solve these problems, we propose a multimodal fusion deep convolutional neural network (MFDCNN) framework for object detection and segmentation, which can boost the performance effectively at two levels whilst keeping the framework end-to-end training. Towards the object detection, we adopt a multimodal region proposal network to solve the problem of object-level detection, towards the semantic segmentation, we utilize a multimodal fully convolutional network to provide the class labels to which each pixel belongs. We focus on learning object detection and segmentation simultaneous, we propose a novel loss function to combine these two kind networks together. Under this framework, we focus on cluttered indoor scenes with challenging settings and evaluate the performance of our MFDCNN on the NYU-Depth V2 dataset. Our MFDCNN achieves state-of-the-art performance on the object detection task and earns the comparable state-of-the-art performance on the task of semantic segmentation.
[1]
Qiang Huang,et al.
Convolutional gated recurrent neural network incorporating spatial features for audio tagging
,
2017,
2017 International Joint Conference on Neural Networks (IJCNN).
[2]
Nitish Srivastava,et al.
Dropout: a simple way to prevent neural networks from overfitting
,
2014,
J. Mach. Learn. Res..
[3]
Tara N. Sainath,et al.
Deep Neural Networks for Acoustic Modeling in Speech Recognition
,
2012
.
[4]
Huy Phan,et al.
Classifying Variable-Length Audio Files with All-Convolutional Networks and Masked Global Pooling
,
2016,
ArXiv.
[5]
Geoffrey E. Hinton,et al.
ImageNet classification with deep convolutional neural networks
,
2012,
Commun. ACM.
[6]
Yoshua Bengio,et al.
Gradient-based learning applied to document recognition
,
1998,
Proc. IEEE.