ReSeg: A Recurrent Neural Network-Based Model for Semantic Segmentation

We propose a structured prediction architecture, which exploits the local generic features extracted by Convolutional Neural Networks and the capacity of Recurrent Neural Networks (RNN) to retrieve distant dependencies. The proposed architecture, called ReSeg, is based on the recently introduced ReNet model for image classification. We modify and extend it to perform the more challenging task of semantic segmentation. Each ReNet layer is composed of four RNN that sweep the image horizontally and vertically in both directions, encoding patches or activations, and providing relevant global information. Moreover, ReNet layers are stacked on top of pre-trained convolutional layers, benefiting from generic local features. Upsampling layers follow ReNet layers to recover the original image resolution in the final predictions. The proposed ReSeg architecture is efficient, flexible and suitable for a variety of semantic segmentation tasks. We evaluate ReSeg on several widely-used semantic segmentation datasets: Weizmann Horse, Oxford Flower, and CamVid, achieving stateof-the-art performance. Results show that ReSeg can act as a suitable architecture for semantic segmentation tasks, and may have further applications in other structured prediction problems. The source code and model hyperparameters are available on https://github.com/fvisin/reseg.

[1]  Ying Zhang,et al.  Batch normalized recurrent neural networks , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Philip H. S. Torr,et al.  What, Where and How Many? Combining Object Detectors and CRFs , 2010, ECCV.

[3]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[4]  Jana Kosecka,et al.  Nonparametric Scene Parsing with Adaptive Feature Relevance and Semantic Context , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Jürgen Schmidhuber,et al.  Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation , 2015, NIPS.

[6]  Yizhou Yu,et al.  Combining the Best of Convolutional Layers and Recurrent Layers: A Hybrid Network for Semantic Segmentation , 2016, ArXiv.

[7]  Francesco Visin,et al.  A guide to convolution arithmetic for deep learning , 2016, ArXiv.

[8]  Dumitru Erhan,et al.  Scalable Object Detection Using Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Shimon Ullman,et al.  Combining Top-Down and Bottom-Up Segmentation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[10]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[11]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Yoshua Bengio,et al.  ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks , 2015, ArXiv.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[15]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[16]  Joost van de Weijer,et al.  Unrolling Loopy Top-Down Semantic Feedback in Convolutional Deep Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[17]  Ming-Hsuan Yang,et al.  PatchCut: Data-driven object segmentation via local shape transfer , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Rob Fergus,et al.  Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Guosheng Lin,et al.  CRF Learning with CNN Features for Image Segmentation , 2015, Pattern Recognit..

[20]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Andrew Blake,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[26]  Philip H. S. Torr,et al.  Combining Appearance and Structure from Motion Features for Road Scene Understanding , 2009, BMVC.

[27]  Andrew Zisserman,et al.  A Visual Vocabulary for Flower Classification , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[29]  Kunio Kashino,et al.  Tri-Map Self-Validation Based on Least Gibbs Energy for Foreground Segmentation , 2014, BMVC.

[30]  Roberto Cipolla,et al.  Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding , 2015, BMVC.

[31]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[34]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[37]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[38]  Jonathan T. Barron,et al.  Semantic Image Segmentation with Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Alistair A. Young,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2017, MICCAI 2017.

[40]  Hui Jiang,et al.  Generating images with recurrent adversarial networks , 2016, ArXiv.

[41]  Tianli Yu,et al.  Kernelized structural SVM learning for supervised object segmentation , 2011, CVPR 2011.

[42]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[43]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[44]  Svetlana Lazebnik,et al.  Superparsing - Scalable Nonparametric Image Parsing with Superpixels , 2010, Int. J. Comput. Vis..

[45]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[46]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[47]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[48]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling , 2015, CVPR 2015.

[49]  Colin Raffel,et al.  Lasagne: First release. , 2015 .

[50]  Marcus Liwicki,et al.  Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).