Visual language modeling for image classification

Although it has been studied for many years, image classification is still a challenging problem. In this paper, we propose a visual language modeling method for content-based image classification. It transforms each image into a matrix of visual words, and assumes that each visual word is conditionally dependent on its neighbors. For each image category, a visual language model is constructed using a set of training images, which captures both the co-occurrence and proximity information of visual words. According to how many neighbors are taken in consideration, three kinds of language models can be trained, including unigram, bigram and trigram, each of which corresponds to a different level of model complexity. Given a test image, its category is determined by estimating how likely it is generated under a specific category. Compared with traditional methods that are based on bag-of-words models, the proposed method can utilize the spatial correlation of visual words effectively in image classification. In addition, we propose to use the absent words, which refer to those appearing frequently in a category but not in the target image, to help image classification. Experimental results show that our method can achieve comparable accuracy while performing classification much more quickly.

[1]  Wen Gao,et al.  Effective and efficient object-based image retrieval using visual phrases , 2006, MM '06.

[2]  Luc Van Gool,et al.  Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[3]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[4]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[6]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  Nenghai Yu,et al.  A Search-Based Web Image Annotation Method , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[8]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[9]  Rosalind W. Picard,et al.  Texture orientation for sorting photos "at a glance" , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[10]  James Ze Wang,et al.  Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[12]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[13]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[14]  Raphaël Marée,et al.  Random subwindows for robust image classification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Yixin Chen,et al.  A sparse support vector machine approach to region-based image categorization , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[17]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[18]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[19]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[22]  Changhu Wang,et al.  Scalable search-based image annotation of personal images , 2006, MIR '06.

[23]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[24]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[25]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[26]  Wei-Ying Ma,et al.  AnnoSearch: Image Auto-Annotation by Search , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Cordelia Schmid,et al.  An Affine Invariant Interest Point Detector , 2002, ECCV.

[28]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[29]  Hisham Othman,et al.  Low complexity 2-D Hidden Markov Model for face recognition , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[30]  Latifur Khan,et al.  Image annotations by combining multiple evidence & wordNet , 2005, ACM Multimedia.

[31]  Bin Wang,et al.  Large-Scale Duplicate Detection for Web Image Search , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[32]  James Ze Wang,et al.  SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries , 2000, IEEE Trans. Pattern Anal. Mach. Intell..