Deep Deconvolutional Networks for Scene Parsing

Scene parsing is an important and challenging prob- lem in computer vision. It requires labeling each pixel in an image with the category it belongs to. Tradition- ally, it has been approached with hand-engineered features from color information in images. Recently convolutional neural networks (CNNs), which automatically learn hierar- chies of features, have achieved record performance on the task. These approaches typically include a post-processing technique, such as superpixels, to produce the final label- ing. In this paper, we propose a novel network architecture that combines deep deconvolutional neural networks with CNNs. Our experiments show that deconvolutional neu- ral networks are capable of learning higher order image structure beyond edge primitives in comparison to CNNs. The new network architecture is employed for multi-patch training, introduced as part of this work. Multi-patch train- ing makes it possible to effectively learn spatial priors from scenes. The proposed approach yields state-of-the-art per- formance on four scene parsing datasets, namely Stanford Background, SIFT Flow, CamVid, and KITTI. In addition, our system has the added advantage of having a training system that can be completely automated end-to-end with- out requiring any post-processing.

[1]  R. Vaillant,et al.  Original approach for the localisation of objects in images , 1994 .

[2]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[3]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[4]  Thomas Hofmann,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2007 .

[5]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[6]  Bill Triggs,et al.  Scene Segmentation with CRFs Learned from Partially Labeled Images , 2007, NIPS.

[7]  George Loizou,et al.  Computer vision and pattern recognition , 2007, Int. J. Comput. Math..

[8]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[10]  Cordelia Schmid,et al.  Object Recognition by Integrating Multiple Image Segmentations , 2008, ECCV.

[11]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[12]  Antonio Criminisi,et al.  Object Class Segmentation using Random Forests , 2008, BMVC.

[13]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[14]  Philip H. S. Torr,et al.  Combining Appearance and Structure from Motion Features for Road Scene Understanding , 2009, BMVC.

[15]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[16]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Pushmeet Kohli,et al.  Associative hierarchical CRFs for object class image segmentation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Stephen Gould,et al.  Decomposing a scene into geometric and semantically consistent regions , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Philip H. S. Torr,et al.  What, Where and How Many? Combining Object Detectors and CRFs , 2010, ECCV.

[21]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[22]  Martial Hebert,et al.  Stacked Hierarchical Labeling , 2010, ECCV.

[23]  Svetlana Lazebnik,et al.  Superparsing , 2010, International Journal of Computer Vision.

[24]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[25]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Bastian Leibe,et al.  Multi-Class Image Labeling with Top-Down Segmentation and Generalized Robust $P^N$ Potentials , 2011, BMVC.

[27]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[28]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[30]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[31]  Yann LeCun,et al.  Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers , 2012, ICML.

[32]  Franz Kummert,et al.  Spatial ray features for real-time ego-lane extraction , 2012, 2012 15th International IEEE Conference on Intelligent Transportation Systems.

[33]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[34]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Svetlana Lazebnik,et al.  Finding Things: Image Parsing with Regions and Per-Exemplar Detectors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Giovani Bernardes Vitor,et al.  A probabilistic distribution approach for the classification of urban roads in complex environments , 2014 .

[38]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  Denis Fernando Wolf,et al.  Road terrain detection: Avoiding common obstacle detection assumptions using sensor fusion , 2014, 2014 IEEE Intelligent Vehicles Symposium Proceedings.