Learning Hierarchical Features for Scene Labeling

Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude faster than competing approaches, producing a 320×240 image labeling in less than a second, including feature extraction.

[1]  D. R. Fulkerson,et al.  A Simple Algorithm for Finding Maximal Network Flows and an Application to the Hitchcock Problem , 1957, Canadian Journal of Mathematics.

[2]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[3]  R. Vaillant,et al.  An original approach for the localization of objects in images , 1993 .

[4]  R. Vaillant,et al.  Original approach for the localisation of objects in images , 1994 .

[5]  Laurent Najman,et al.  Geodesic Saliency of Watershed Contours and Hierarchical Segmentation , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[8]  Y.Y. Boykov,et al.  Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[9]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[10]  Patrice Y. Simard,et al.  Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Christophe Garcia,et al.  Convolutional face finder: a neural architecture for fast and robust face detection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Yann LeCun,et al.  Synergistic Face Detection and Pose Estimation with Energy-Based Models , 2004, J. Mach. Learn. Res..

[13]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Yann LeCun,et al.  Toward automatic phenotyping of developing embryos from videos , 2005, IEEE Transactions on Image Processing.

[15]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[17]  Antonio Torralba,et al.  Object Recognition by Scene Alignment , 2007, NIPS.

[18]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Joseph F. Murray,et al.  Supervised Learning of Image Restoration with Convolutional Networks , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[20]  Richard S. Zemel,et al.  Learning Hybrid Models for Image Annotation with Partially Labeled Data , 2008, NIPS.

[21]  Cordelia Schmid,et al.  Object Recognition by Integrating Multiple Image Segmentations , 2008, ECCV.

[22]  Stephen Gould,et al.  Multi-Class Segmentation with Relative Location Prior , 2008, International Journal of Computer Vision.

[23]  H. Sebastian Seung,et al.  Maximin affinity learning of image segmentation , 2009, NIPS.

[24]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[25]  L. Bottou,et al.  Deep Convolutional Networks for Scene Parsing , 2009 .

[26]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  Stefano Soatto,et al.  Class segmentation and object localization with superpixel neighborhoods , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Pushmeet Kohli,et al.  Associative hierarchical CRFs for object class image segmentation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[30]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Antonio Torralba,et al.  Nonparametric scene parsing: Label transfer via dense scene alignment , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Daphne Koller,et al.  Efficiently selecting regions for scene understanding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  Martial Hebert,et al.  Stacked Hierarchical Labeling , 2010, ECCV.

[34]  Y-Lan Boureau,et al.  Learning Convolutional Feature Hierarchies for Visual Recognition , 2010, NIPS.

[35]  Svetlana Lazebnik,et al.  Superparsing - Scalable Nonparametric Image Parsing with Superpixels , 2010, Int. J. Comput. Vis..

[36]  Berin Martini,et al.  Hardware accelerated convolutional neural networks for synthetic vision systems , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[37]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[38]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[39]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[41]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[42]  Andrew Zisserman,et al.  Pylon Model for Semantic Segmentation , 2011, NIPS.

[43]  Jürgen Schmidhuber,et al.  A committee of neural networks for traffic sign classification , 2011, The 2011 International Joint Conference on Neural Networks.

[44]  Yann LeCun,et al.  Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers , 2012, ICML.

[45]  Sven Behnke,et al.  Learning Object-Class Segmentation with Convolutional Neural Networks , 2012, ESANN.

[46]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[47]  Cristian Sminchisescu,et al.  CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Svetlana Lazebnik,et al.  Superparsing - Scalable Nonparametric Image Parsing with Superpixels , 2010, International Journal of Computer Vision.

[49]  智一 吉田,et al.  Efficient Graph-Based Image Segmentationを用いた圃場図自動作成手法の検討 , 2014 .