Learning Hierarchical Feature Extractors For Image Recognition

Telling cow from sheep is effortless for most animals, but requires much engineering for computers. In this thesis, we seek to tease out basic principles that underlie many recent advances in image recognition. First, we recast many methods into a common unsupervised feature extraction framework based on an alternation of coding steps, which encode the input by comparing it with a collection of reference patterns, and pooling steps, which compute an aggregation statistic summarizing the codes within some region of interest of the image. Within that framework, we conduct extensive comparative evaluations of many coding or pooling operators proposed in the literature. Our results demonstrate a robust superiority of sparse coding (which decomposes an input as a linear combination of a few visual words) and max pooling (which summarizes a set of inputs by their maximum value). We also propose macrofeatures, which import into the popular spatial pyramid framework the joint encoding of nearby features commonly practiced in neural networks, and obtain significantly improved image recognition performance. Next, we analyze the statistical properties of max pooling that underlie its better performance, through a simple theoretical model of feature activation. We then present results of experiments that confirm many predictions of the model. Beyond the pooling operator itself, an important parameter is the set of pools over which the summary statistic is computed. We propose locality in feature configuration space as a natural criterion for devising better pools. Finally, we propose ways to make coding faster and more powerful through fast convolutional feedforward architectures, and examine how to incorporate supervision into feature extraction schemes. Overall, our experiments offer insights into what makes current systems work so well, and state-of-the-art results on several image recognition benchmarks.

[1]  Matthew A. Brown,et al.  Learning Local Image Descriptors , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[4]  Kristen Grauman,et al.  Asymmetric region-to-image matching for comparing images with generic object categories , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[6]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[7]  Y-Lan Boureau,et al.  Learning Convolutional Feature Hierarchies for Visual Recognition , 2010, NIPS.

[8]  Lawrence K. Saul,et al.  Think Globally, Fit Locally: Unsupervised Learning of Low Dimensional Manifold , 2003, J. Mach. Learn. Res..

[9]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[12]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[13]  Rajat Raina,et al.  Self-taught learning , 2009 .

[14]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[15]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[16]  Eero P. Simoncelli,et al.  Nonlinear image representation using divisive normalization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Guillermo Sapiro,et al.  Non-local sparse models for image restoration , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[21]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[22]  Andrea J. van Doorn,et al.  The Structure of Locally Orderless Images , 1999, International Journal of Computer Vision.

[23]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[24]  David J. Field,et al.  How Close Are We to Understanding V1? , 2005, Neural Computation.

[25]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[26]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[28]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[29]  Marc'Aurelio Ranzato,et al.  A Unified Energy-Based Framework for Unsupervised Learning , 2007, AISTATS.

[30]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[31]  Julien Mairal,et al.  Proximal Methods for Sparse Hierarchical Dictionary Learning , 2010, ICML.

[32]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[34]  Michael Elad,et al.  K-SVD and its non-negative variant for dictionary design , 2005, SPIE Optics + Photonics.

[35]  Prateek Jain,et al.  Fast image search for learned metrics , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[37]  Thomas S. Huang,et al.  A novel Gaussianized vector representation for natural scene categorization , 2008, 2008 19th International Conference on Pattern Recognition.

[38]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[39]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[40]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[41]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Eero P. Simoncelli,et al.  Natural signal statistics and sensory gain control , 2001, Nature Neuroscience.

[43]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Garrison W. Cottrell,et al.  Robust classification of objects, faces, and flowers using natural image statistics , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  S. Mallat A wavelet tour of signal processing , 1998 .

[47]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[49]  R. Fergus,et al.  Learning invariant features through topographic filter maps , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[51]  Edward H. Adelson,et al.  Shiftable multiscale transforms , 1992, IEEE Trans. Inf. Theory.

[52]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[54]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[55]  Yann LeCun,et al.  Structured sparse coding via lateral inhibition , 2011, NIPS.

[56]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[57]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[58]  Nicolas Le Roux,et al.  Ask the locals: Multi-way local pooling for image recognition , 2011, 2011 International Conference on Computer Vision.

[59]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[60]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[61]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Michael Elad,et al.  Image Denoising Via Learned Dictionaries and Sparse representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[63]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[64]  Jitendra Malik,et al.  A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[65]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[67]  Aapo Hyvärinen,et al.  A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images , 2001, Vision Research.

[68]  Karen O. Egiazarian,et al.  Image denoising with block-matching and 3D filtering , 2006, Electronic Imaging.

[69]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[70]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[71]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[72]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[73]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[74]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[75]  Narendra Ahuja,et al.  Learning subcategory relevances for category recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[76]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[77]  Thomas S. Huang,et al.  Image super-resolution as sparse representation of raw image patches , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[78]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[79]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[80]  Jean Ponce,et al.  A graph-matching kernel for object categorization , 2011, 2011 International Conference on Computer Vision.

[81]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[82]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[83]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[84]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[85]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[86]  Zaïd Harchaoui,et al.  DIFFRAC: a discriminative and flexible framework for clustering , 2007, NIPS.

[87]  Alfred M. Bruckstein,et al.  Monotonicity of Linear Separability Under Translation , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[88]  Kunihiko Fukushima,et al.  Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position , 1982, Pattern Recognit..

[89]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[90]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[91]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[92]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[93]  Matteo Carandini,et al.  What simple and complex cells compute , 2006, The Journal of physiology.

[94]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[95]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[96]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[97]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[98]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[99]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[100]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[101]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[102]  Grgoire Montavon,et al.  Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[103]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[104]  Jason Weston,et al.  Deep learning via semi-supervised embedding , 2008, ICML '08.

[105]  Yann LeCun,et al.  Fast Approximations to Structured Sparse Coding and Applications to Object Classification , 2012, ECCV.

[106]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[107]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[108]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[109]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[110]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[111]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[112]  Wen Gao,et al.  Group-sensitive multiple kernel learning for object categorization , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[113]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[114]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[115]  Thomas S. Huang,et al.  Efficient Highly Over-Complete Sparse Coding Using a Mixture Model , 2010, ECCV.