Towards semantic embedding in visual vocabulary

Visual vocabulary serves as a fundamental component in many computer vision tasks, such as object recognition, visual search, and scene modeling. While state-of-the-art approaches build visual vocabulary based solely on visual statistics of local image patches, the correlative image labels are left unexploited in generating visual words. In this work, we present a semantic embedding framework to integrate semantic information from Flickr labels for supervised vocabulary construction. Our main contribution is a Hidden Markov Random Field modeling to supervise feature space quantization, with specialized considerations to label correlations: Local visual features are modeled as an Observed Field, which follows visual metrics to partition feature space. Semantic labels are modeled as a Hidden Field, which imposes generative supervision to the Observed Field with WordNet-based correlation constraints as Gibbs distribution. By simplifying the Markov property in the Hidden Field, both unsupervised and supervised (label independent) vocabularies can be derived from our framework. We validate our performances in two challenging computer vision tasks with comparisons to state-of-the-arts: (1) Large-scale image search on a Flickr 60,000 database; (2) Object recognition on the PASCAL VOC database.

[1]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[2]  Thomas Serre,et al.  A Theory of Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex , 2005 .

[3]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, CVPR Workshops.

[4]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[5]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  Yang Yang,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, CVPR.

[8]  Maosong Sun,et al.  Automatic Image Annotation Based on WordNet and Hierarchical Ensembles , 2006, CICLing.

[9]  Dan Roth,et al.  Learning a Sparse Representation for Object Detection , 2002, ECCV.

[10]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Frédéric Jurie,et al.  Fast Discriminative Visual Codebooks using Randomized Clustering Forests , 2006, NIPS.

[13]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Michael I. Jordan,et al.  Learning Spectral Clustering , 2003, NIPS.

[16]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[17]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Kenneth Rose,et al.  A generalized VQ method for combined compression and estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[20]  Richard Szeliski,et al.  City-Scale Location Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[22]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[23]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[24]  Gabriela Csurka,et al.  Adapted Vocabularies for Generic Visual Categorization , 2006, ECCV.

[25]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[26]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[27]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[29]  Cordelia Schmid,et al.  Semantic Hierarchies for Visual Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Rongrong Ji,et al.  Vocabulary hierarchy optimization for effective and transferable retrieval , 2009, CVPR.