Indoor Scene Recognition using Task and Saliency-driven Feature Pooling

Indoor scenes are characterized by a high intra-class variability, mainly due to the intrinsic variety of the objects in them, and to the drastic image variations due to (even small) view-point changes. One of the main trends in the literature has been to employ representations coupling statistical characterizations of the image, with a description of their spatial distribution. This is usually done by combining multiple representations of different image regions, most often using a fixed 4x4, or pyramidal image-partitioning scheme. While these encodings are able to capture the spatial regularities of the problem, they are unsuitable to handle its spatial variabilities. In this work we propose to complement a traditional spatial-encoding scheme with a bottom-up approach designed to discover visual-structures regardless of their exact position in the scene. To this end we use saliency maps to segment each image in two regions: the most and least salient 50%. This segmentation provides a description of the images which is somehow related to the relative semantics of the discovered regions, complementing the canonical spatial description. We evaluated the proposed technique on three public scene recognition datasets. Our results prove this approach to be effective in the indoor scenario, while being also meaningful for other scene categorization tasks.

[1]  James M. Rehg,et al.  CENTRIST: A Visual Descriptor for Scene Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Zhen Li,et al.  Hierarchical Gaussianization for image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Dewen Hu,et al.  Natural scene recognition using weighted histograms of gradient orientation descriptor , 2011 .

[5]  Shin'ichi Satoh,et al.  Building Compact Local Pairwise Codebook with Joint Feature Space Clustering , 2010, ECCV.

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[8]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[11]  Nuno Vasconcelos,et al.  Integrated learning of saliency, complex features, and object detectors from cluttered scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[13]  KochChristof,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 1998 .

[14]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Frédéric Jurie,et al.  Learning Saliency Maps for Object Categorization , 2006 .

[16]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[17]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[18]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[19]  Yasuo Kuniyoshi,et al.  Discriminative spatial pyramid , 2011, CVPR 2011.

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Yannis Avrithis,et al.  Spatiotemporal saliency for video classification , 2009, Signal Process. Image Commun..

[22]  Özgür Ulusoy,et al.  Nearest-Neighbor based Metric Functions for indoor scene recognition , 2011, Comput. Vis. Image Underst..

[23]  Yasuo Kuniyoshi,et al.  Global Gaussian approach for scene categorization using information geometry , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  John K. Tsotsos,et al.  Saliency Based on Information Maximization , 2005, NIPS.

[25]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[26]  Tsuhan Chen,et al.  Determining Patch Saliency Using Low-Level Context , 2008, ECCV.

[27]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[28]  Timothy F. Cootes,et al.  Locating Salient Object Features , 1998, BMVC.

[29]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[30]  Shree K. Nayar,et al.  Multiresolution histograms and their use for recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.