Learning Shared, Discriminative, and Compact Representations for Visual Recognition

Dictionary-based and part-based methods are among the most popular approaches to visual recognition. In both methods, a mid-level representation is built on top of low-level image descriptors and high-level classifiers are trained on top of the mid-level representation. While earlier methods built the mid-level representation without supervision, there is currently great interest in learning both representations jointly to make the mid-level representation more discriminative. In this work we propose a new approach to visual recognition that jointly learns a shared, discriminative, and compact mid-level representation and a compact high-level representation. By using a structured output learning framework, our approach directly handles the multiclass case at both levels of abstraction. Moreover, by using a group-sparse prior in the structured output learning framework, our approach encourages sharing of visual words and thus reduces the number of words used to represent each class. We test our proposed method on several popular benchmarks. Our results show that, by jointly learning midand high-level representations, and fostering the sharing of discriminative visual words among target classes, we are able to achieve state-of-the-art recognition performance using far less visual words than previous approaches.

[1]  Shuicheng Yan,et al.  An HOG-LBP human detector with partial occlusion handling , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[2]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[3]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[4]  Jean Ponce,et al.  Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[7]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[8]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Kristen Grauman,et al.  Sharing features between objects and their attributes , 2011, CVPR 2011.

[10]  Pedro F. Felzenszwalb,et al.  Reconfigurable models for scene recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Jianping Fan,et al.  Jointly Learning Visually Correlated Dictionaries for Large-Scale Visual Recognition Applications , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Trevor Darrell,et al.  Beyond spatial pyramids: Receptive field learning for pooled image features , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[17]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[18]  Nicholas Roy,et al.  Indoor scene recognition by a mobile robot through adaptive object detection , 2013, Robotics Auton. Syst..

[19]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[20]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[22]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[26]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[27]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[28]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[29]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[30]  René Vidal,et al.  Visual Dictionary Learning for Joint Object Categorization and Segmentation , 2012, ECCV.

[31]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[32]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[33]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[34]  Zhiwei Li,et al.  Max-Margin Dictionary Learning for Multiclass Image Categorization , 2010, ECCV.

[35]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Hervé Le Borgne,et al.  Locality-constrained and spatially regularized coding for scene categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Antonio Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[41]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[42]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[43]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[44]  Fei-Fei Li,et al.  Large Margin Learning of Upstream Scene Understanding Models , 2010, NIPS.

[45]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[46]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[47]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[48]  Mark Everingham,et al.  Shared parts for deformable part-based models , 2011, CVPR 2011.

[49]  Frédéric Jurie,et al.  Creating efficient codebooks for visual recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[50]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[52]  S. V. N. Vishwanathan,et al.  A quasi-Newton approach to non-smooth convex optimization , 2008, ICML '08.

[53]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[54]  William T. Freeman,et al.  Group Norm for Learning Structured SVMs with Unstructured Latent Variables , 2013, 2013 IEEE International Conference on Computer Vision.

[55]  René Vidal,et al.  Joint Dictionary and Classifier Learning for Categorization of Images Using a Max-margin Framework , 2013, PSIVT.

[56]  Steve Branson,et al.  Similarity metrics for categorization: From monolithic to category specific , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[57]  René Vidal,et al.  Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[58]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.