Indoor Scene Recognition from RGB-D Images by Learning Scene Bases

In this paper, we propose a RGB-D indoor scene recognition method that has mainly two advantages as compared to existing methods. First, by training object detectors using RGB-D images and recognizing their spatial interrelationships, we not only achieve better object localization accuracy than using RGB images alone, but also obtain details as to how the objects are related to each other in a spatial manner, thus resulting in a more effective high-level feature representation of the scene known as the Objects and Attributes (O&A) representation. Second, we learn class-specific sub-dictionaries that capture the high-order couplings between the objects and attributes. In particular, elastic net regularization and geometric similarity constraint is imposed to increase the discriminative power of the sub-dictionaries. The proposed method is evaluated on two RGB-D datasets, the NYUD dataset and the B3DO dataset. Experiments show that superior scene recognition rate can be obtained using our method.

[1]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[2]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Guillermo Sapiro,et al.  Classification and clustering via dictionary learning with structured incoherence and shared features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[5]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Dieter Fox,et al.  A large-scale hierarchical multi-view RGB-D object dataset , 2011, 2011 IEEE International Conference on Robotics and Automation.

[7]  Jitendra Malik,et al.  Object Detection in RGB-D Indoor Scenes 1 , 2013 .

[8]  Xiangyang Xue,et al.  Learning Hybrid Part Filters for Scene Recognition , 2012, ECCV.

[9]  Gordon D. Logan,et al.  A computational analysis of the apprehension of spatial relations , 1996 .

[10]  Fernando Díaz-de-María,et al.  A spatially aware generative model for image classification, topic discovery and segmentation , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[11]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Jonathan T. Barron,et al.  A category-level 3-D object dataset: Putting the Kinect to work , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[13]  Gang Hua,et al.  Spatial-DiscLDA for visual recognition , 2011, CVPR 2011.

[14]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[15]  Lorenzo Rosasco,et al.  Elastic-net regularization in learning theory , 2008, J. Complex..

[16]  Lixin Fan,et al.  Categorizing Nine Visual Classes using Local Appearance Descriptors , 2004 .

[17]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[18]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[19]  Chong Wang,et al.  Exploring relations of visual codes for image classification , 2011, CVPR 2011.

[20]  Donghui Wang,et al.  A Dictionary Learning Approach for Classification: Separating the Particularity and the Commonality , 2012, ECCV.

[21]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, CVPR.

[22]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[23]  M. Elad,et al.  $rm K$-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation , 2006, IEEE Transactions on Signal Processing.

[24]  Barbara Caputo,et al.  Recognition with local features: the kernel recipe , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[25]  Hao Su,et al.  Objects as Attributes for Scene Classification , 2010, ECCV Workshops.

[26]  Qi Tian,et al.  Visual Synset: Towards a higher-level visual representation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[28]  Jake K. Aggarwal,et al.  Scene recognition by jointly modeling latent topics , 2014, IEEE Winter Conference on Applications of Computer Vision.

[29]  Ming Yang,et al.  Discovery of Collocation Patterns: from Visual Words to Visual Phrases , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Gang Hua,et al.  Context aware topic model for scene recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Hairong Qi,et al.  Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.