A Thousand Words in a Scene

This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to elucidate (1) whether a textlike bag-of-visterms (BOV) representation (histogram of quantized local visual features) is suitable for scene (rather than object) classification, (2) whether some analogies between discrete scene representations and text documents exist, and 3) whether unsupervised, latent space models can be used both as feature extractors for the classification task and to discover patterns of visual co-occurrence. Using several data sets, we validate our approach, presenting and discussing experiments on each of these issues. We first show, with extensive experiments on binary and multiclass scene classification tasks using a 9,500-image data set, that the BOV representation consistently outperforms classical scene classification approaches. In other data sets, we show that our approach competes with or outperforms other recent more complex methods. We also show that probabilistic latent semantic analysis (PLSA) generates a compact scene representation, is discriminative for accurate classification, and is more robust than the BOV representation when less labeled training data is available. Finally, through aspect-based image ranking experiments, we show the ability of PLSA to automatically extract visually meaningful scene patterns, making such representation useful for browsing image collections.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[4]  Bernt Schiele,et al.  Natural Scene Retrieval Based on a Semantic Modeling Step , 2004, CIVR.

[5]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Peter Auer,et al.  Weak Hypotheses and Boosting for Generic Object Detection and Recognition , 2004, ECCV.

[7]  Luc Van Gool,et al.  Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[8]  Andrew Zisserman,et al.  Multi-view Matching for Unordered Image Sets, or "How Do I Organize My Holiday Snaps?" , 2002, ECCV.

[9]  Rosalind W. Picard,et al.  Texture orientation for sorting photos "at a glance" , 1994, Proceedings of 12th International Conference on Pattern Recognition.

[10]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Joo-Hwee Lim,et al.  Semantics Discovery for Image Indexing , 2004, ECCV.

[12]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Luc Van Gool,et al.  Content-Based Image Retrieval Based on Local Affinely Invariant Regions , 1999, VISUAL.

[14]  Andrew Zisserman,et al.  Video data mining using configurations of viewpoint invariant regions , 2004, CVPR 2004.

[15]  Anil K. Jain,et al.  Image classification for content-based indexing , 2001, IEEE Trans. Image Process..

[16]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[17]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[18]  Martial Hebert,et al.  Man-made structure detection in natural images using a causal multiscale random field , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[19]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[20]  Shih-Fu Chang,et al.  A knowledge engineering approach for image classification based on probabilistic reasoning systems , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[21]  Luc Van Gool,et al.  Fast indexing for image retrieval based on local appearance with re-ranking , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[22]  Cordelia Schmid,et al.  Selection of scale-invariant parts for object class recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[24]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[25]  Nozha Boujemaa,et al.  New image retrieval paradigm: logical composition of region categories , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[26]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[27]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[28]  Paul Over,et al.  The TREC-2002 Video Track Report , 2002, TREC.

[29]  Lixin Fan,et al.  Categorizing Nine Visual Classes using Local Appearance Descriptors , 2004 .

[30]  Jean-Marc Odobez,et al.  Constructing Visual Models with a Latent Space Approach , 2005, SLSFS.

[31]  Daniel Gatica-Perez,et al.  On image auto-annotation with latent space models , 2003, ACM Multimedia.

[32]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Pietro Perona,et al.  A Bayesian approach to unsupervised one-shot learning of object categories , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[34]  B. Schiele,et al.  Interleaved Object Categorization and Segmentation , 2003, BMVC.

[35]  Martial Hebert,et al.  Discriminative random fields: a discriminative framework for contextual interaction in classification , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[36]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[37]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[38]  Daniel Gatica-Perez,et al.  PLSA-based image auto-annotation: constraining the latent space , 2004, MULTIMEDIA '04.

[39]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[40]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[41]  Andrew Zisserman,et al.  Video data mining using configurations of viewpoint invariant regions , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[42]  Zhongfei Zhang,et al.  Hidden semantic concept discovery in region based image retrieval , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[43]  Antonio Torralba,et al.  Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[44]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[45]  HofmannThomas Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2001 .

[46]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[47]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[48]  Jiebo Luo,et al.  A computationally efficient approach to indoor/outdoor scene classification , 2002, Object recognition supported by user interaction for service robots.

[49]  Pietro Perona,et al.  A Bayesian approach to unsupervised one-shot learning of object categories , 2003, ICCV 2003.

[50]  SchmidCordelia,et al.  A Performance Evaluation of Local Descriptors , 2005 .