Weakly Supervised Visual Dictionary Learning by Harnessing Image Attributes

Bag-of-features (BoFs) representation has been extensively applied to deal with various computer vision applications. To extract discriminative and descriptive BoF, one important step is to learn a good dictionary to minimize the quantization loss between local features and codewords. While most existing visual dictionary learning approaches are engaged with unsupervised feature quantization, the latest trend has turned to supervised learning by harnessing the semantic labels of images or regions. However, such labels are typically too expensive to acquire, which restricts the scalability of supervised dictionary learning approaches. In this paper, we propose to leverage image attributes to weakly supervise the dictionary learning procedure without requiring any actual labels. As a key contribution, our approach establishes a generative hidden Markov random field (HMRF), which models the quantized codewords as the observed states and the image attributes as the hidden states, respectively. Dictionary learning is then performed by supervised grouping the observed states, where the supervised information is stemmed from the hidden states of the HMRF. In such a way, the proposed dictionary learning approach incorporates the image attributes to learn a semantic-preserving BoF representation without any genuine supervision. Experiments in large-scale image retrieval and classification tasks corroborate that our approach significantly outperforms the state-of-the-art unsupervised dictionary learning approaches.

[1]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[5]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Ohad Shamir,et al.  Good learners for evil teachers , 2009, ICML '09.

[10]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[11]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[12]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[13]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[16]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[17]  Wen Gao,et al.  Towards semantic embedding in visual vocabulary , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Larry S. Davis,et al.  Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[20]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[21]  Qi Tian,et al.  Task-Dependent Visual-Codebook Compression , 2012, IEEE Transactions on Image Processing.

[22]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[24]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[25]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Andrew W. Fitzgibbon,et al.  Efficient Object Category Recognition Using Classemes , 2010, ECCV.

[28]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[29]  Yang Yang,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, CVPR.

[30]  Michael Isard,et al.  Bundling features for large scale partial-duplicate web image search , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[33]  Wen Gao,et al.  Towards compact topical descriptors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[35]  Kenneth Rose,et al.  A generalized VQ method for combined compression and estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[36]  Gabriela Csurka,et al.  Adapted Vocabularies for Generic Visual Categorization , 2006, ECCV.

[37]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[38]  Ming Yang,et al.  Large-scale image classification: Fast feature extraction and SVM training , 2011, CVPR 2011.

[39]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[40]  Yang Wang,et al.  A Discriminative Latent Model of Object Classes and Attributes , 2010, ECCV.

[41]  Wen Gao,et al.  Location Discriminative Vocabulary Coding for Mobile Landmark Search , 2011, International Journal of Computer Vision.

[42]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[44]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[45]  Rongrong Ji,et al.  Vocabulary hierarchy optimization for effective and transferable retrieval , 2009, CVPR.

[46]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.