Capturing Spatial Interdependence in Image Features: The Counting Grid, an Epitomic Representation for Bags of Features

In recent scene recognition research images or large image regions are often represented as disorganized “bags” of features which can then be analyzed using models originally developed to capture co-variation of word counts in text. However, image feature counts are likely to be constrained in different ways than word counts in text. For example, as a camera pans upwards from a building entrance over its first few floors and then further up into the sky Fig. 1, some feature counts in the image drop while others rise-only to drop again giving way to features found more often at higher elevations. The space of all possible feature count combinations is constrained both by the properties of the larger scene and the size and the location of the window into it. To capture such variation, in this paper we propose the use of the counting grid model. This generative model is based on a grid of feature counts, considerably larger than any of the modeled images, and considerably smaller than the real estate needed to tile the images next to each other tightly. Each modeled image is assumed to have a representative window in the grid in which the feature counts mimic the feature distribution in the image. We provide a learning procedure that jointly maps all images in the training set to the counting grid and estimates the appropriate local counts in it. Experimentally, we demonstrate that the resulting representation captures the space of feature count combinations more accurately than the traditional models, not only when the input images come from a panning camera, but even when modeling images of different scenes from the same category.

[1]  B. Frey,et al.  Transformation-Invariant Clustering Using the EM Algorithm , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Alessandro Perina,et al.  Multiple-shot person re-identification by chromatic and epitomic analyses , 2012, Pattern Recognit. Lett..

[3]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[4]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[5]  Nebojsa Jojic,et al.  Object Recognition with Hierarchical Stel Models , 2010, ECCV.

[6]  Nizar Bouguila,et al.  Count Data Modeling and Classification Using Finite Mixtures of Distributions , 2011, IEEE Transactions on Neural Networks.

[7]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[8]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[9]  Antonio Criminisi,et al.  Epitomic Location Recognition , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[13]  Nebojsa Jojic,et al.  Multidimensional counting grids: Inferring word order from disordered bags of words , 2011, UAI.

[14]  Nebojsa Jojic,et al.  Capturing Layers in Image Collections with Componential Models: From the Layered Epitome to the Componential Counting Grid , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Nicu Sebe,et al.  Tell Me What You Like and I'll Tell You What You Are: Discriminating Visual Preferences on Flickr Data , 2012, ACCV.

[16]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[17]  Nebojsa Jojic,et al.  Spring Lattice Counting Grids: Scene Recognition Using Deformable Positional Constraints , 2012, ECCV.

[18]  Jiebo Luo,et al.  Scene Parsing Using Region-Based Generative Models , 2007, IEEE Transactions on Multimedia.

[19]  Antonio Torralba,et al.  Context-based vision system for place and object recognition , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[20]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[21]  Pedro F. Felzenszwalb,et al.  Reconfigurable models for scene recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Bernt Schiele,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) Semantic Modeling of Natural Scenes for Content-Based Image Retrieval , 2022 .

[23]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[24]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[25]  Nebojsa Jojic,et al.  Structural epitome: a way to summarize one's visual experience , 2010, NIPS.

[26]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[27]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Alessandro Perina,et al.  Learning natural scene categories by selective multi-scale feature extraction , 2010, Image Vis. Comput..

[29]  Nebojsa Jojic,et al.  Bags of Words Models of Epitope Sets: HIV Viral Load Regression with Counting Grids , 2014, Pacific Symposium on Biocomputing.

[30]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[31]  Nebojsa Jojic,et al.  In the sight of my wearable camera: Classifying my visual experience , 2013, ArXiv.

[32]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[33]  Nebojsa Jojic,et al.  Image analysis by counting on a grid , 2011, CVPR 2011.

[34]  Brendan J. Frey,et al.  Epitomic analysis of appearance and shape , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[37]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[38]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[39]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[40]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[42]  Nebojsa Jojic,et al.  Free Energy Score Spaces: Using Generative Information in Discriminative Classifiers , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Nebojsa Jojic,et al.  Skim-reading thousands of documents in one minute: Data indexing and visualization for multifarious search , 2014, KDD 2014.

[44]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[45]  Yuning Jiang,et al.  Randomized Spatial Partition for Scene Recognition , 2012, ECCV.

[46]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[47]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[48]  Shuicheng Yan,et al.  Spatialized epitome and its applications , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.