Image analysis by counting on a grid

In recent object/scene recognition research images or large image regions are often represented as disorganized ”bags” of image features. This representation allows direct application of models of word counts in text. However, the image feature counts are likely to be constrained in different ways than word counts in text. As a camera pans upwards from a building entrance over its first few floors and then above the penthouse to the backdrop formed by the mountains, and then further up into the sky, some feature counts in the image drop while others rise–only to drop again giving way to features found more often at higher elevations (Fig. 1). The space of all possible feature count combinations is constrained by the properties of the larger scene as well as the size and the location of the window into it. Accordingly, our model is based on a grid of feature counts, considerably larger than any of the modeled images, and considerably smaller than the real estate needed to tile the images next to each other tightly. Each modeled image is assumed to have a representative window in the grid in which the sum of feature counts mimics the distribution in the image. We provide learning procedures that jointly map all images in the training set to the counting grid and estimate the appropriate local counts in it. Experimentally, we demonstrate that the resulting representation captures the space of feature count combinations more accurately than the traditional models, such as latent Dirichlet allocation, even when modeling images of different scenes from the same category.

[1]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[2]  Bernt Schiele,et al.  International Journal of Computer Vision manuscript No. (will be inserted by the editor) Semantic Modeling of Natural Scenes for Content-Based Image Retrieval , 2022 .

[3]  Nebojsa Jojic,et al.  Free Energy Score Spaces: Using Generative Information in Discriminative Classifiers , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Nafiz Arica,et al.  Scene Classification Using Spatial Pyramid of Latent Topics , 2010, 2010 20th International Conference on Pattern Recognition.

[5]  Antonio Criminisi,et al.  Epitomic location recognition , 2008, CVPR.

[6]  P. Duygulu,et al.  Visual categorization with bags of keypoints , 2002, eccv 2002.

[7]  Yasuo Kuniyoshi,et al.  Improving Local Descriptors by Embedding Global and Local Spatial Information , 2010, ECCV.

[8]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[9]  Brendan J. Frey,et al.  Epitomic analysis of appearance and shape , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  André F. T. Martins,et al.  Combining free energy score spaces with information theoretic kernels: Application to scene classification , 2010, 2010 IEEE International Conference on Image Processing.

[11]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Andrew Zisserman,et al.  Scene Classification Via pLSA , 2006, ECCV.

[13]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[15]  Jiebo Luo,et al.  Scene Parsing Using Region-Based Generative Models , 2007, IEEE Transactions on Multimedia.

[16]  Nebojsa Jojic,et al.  Structural epitome: a way to summarize one's visual experience , 2010, NIPS.

[17]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[18]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[19]  Nebojsa Jojic,et al.  Free energy score space , 2009, NIPS.

[20]  Nizar Bouguila,et al.  Count Data Modeling and Classification Using Finite Mixtures of Distributions , 2011, IEEE Transactions on Neural Networks.

[21]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.