A Language Modeling Approach to Image Classification

Due to the recent and fast diffusion of new digital devices (digital cameras, camera cell phones, internet), the number and size of image databases is dramatically increasing. Managing such databases is an important issue, for professional databases (e.g. from photo agencies) as well as for personal collections. Image classification and retrieval are therefore becoming more and more challenging. Discriminant image descriptors and robust classifiers are needed to handle these tasks. Nowadays approaches generally rely on describing images as a set of elementary and independent image patches called visual words, then using a classical classifier such as Support Vector Machines. In this paper, we propose a more precise description of images, called visual sentences, that includes simple spatial information between visual words. We then propose a classification technique based on language modeling. This classifier can exploit the spatial information of the visual sentences. Experiments on two classical datasets show that our classification method clearly outperforms the state-of- the-art SVM classifier.

[1]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[2]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[3]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[4]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[5]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[6]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[7]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[8]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[10]  Selim Aksoy,et al.  Scene Classification Using Bag-of-Regions Representations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Jian-Yun Nie,et al.  Using Language Models for Text Classification , 2004 .

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[16]  Pierre Tirilly,et al.  Language modeling for bag-of-visual words image categorization , 2008, CIVR '08.