Recurrent convolutional neural networks for object-class segmentation of RGB-D video

Object-class segmentation is a computer vision task which requires labeling each pixel of an image with the class of the object it belongs to. Deep convolutional neural networks (DNN) are able to learn and exploit local spatial correlations required for this task. They are, however, restricted by their small, fixed-sized filters, which limits their ability to learn long-range dependencies. Recurrent Neural Networks (RNN), on the other hand, do not suffer from this restriction. Their iterative interpretation allows them to model long-range dependencies by propagating activity. This property might be especially useful when labeling video sequences, where both spatial and temporal long-range dependencies occur. In this work, we propose novel RNN architectures for object-class segmentation. We investigate three ways to consider past and future context in the prediction process by comparing networks that process the frames one by one with networks that have access to the whole sequence. We evaluate our models on the challenging NYU Depth v2 dataset for object-class segmentation and obtain competitive results.

[1]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[2]  Dani Lischinski,et al.  Colorization using optimization , 2004, ACM Trans. Graph..

[3]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[4]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Sven Behnke,et al.  Depth and height aware semantic RGB-D perception with convolutional neural networks , 2015, ESANN.

[6]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jörg Stückler,et al.  Dense real-time mapping of object-class semantics from RGB-D video , 2013, Journal of Real-Time Image Processing.

[8]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[9]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[10]  Christopher Kermorvant,et al.  Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[11]  Sven Behnke,et al.  Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Sven Behnke,et al.  Hierarchical Neural Networks for Image Interpretation , 2003, Lecture Notes in Computer Science.

[13]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[14]  Roland Memisevic,et al.  Modeling Deep Temporal Dependencies with Recurrent "Grammar Cells" , 2014, NIPS.

[15]  Harm de Vries,et al.  RMSProp and equilibrated adaptive learning rates for non-convex optimization. , 2015 .

[16]  Yann LeCun,et al.  Indoor Semantic Segmentation using depth information , 2013, ICLR.

[17]  Sven Behnke,et al.  Fast Semantic Segmentation of RGB-D Scenes with GPU-Accelerated Deep Neural Networks , 2014, KI.

[18]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[19]  Jun Tani,et al.  Multiple spatio-temporal scales neural network for contextual visual recognition of human actions , 2014, 4th International Conference on Development and Learning and on Epigenetic Robotics.

[20]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[21]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[22]  Sven Behnke,et al.  Hierarchical Neural Networks for Image Interpretation (Lecture Notes in Computer Science) , 2003 .

[23]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[24]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.