An attention model based on spatial transformers for scene recognition

Scene recognition is an important and challenging task in computer vision. We propose an end-to-end pipeline by combing convolutional neural networks (CNNs) with explicit attention model to determine several meaningful regions of original images for scene recognition. In the proposed pipeline, the spatial transformer network is leveraged as the attention module, which can automatically learn the scales and movements of centers of attention windows. As for feature extraction, the basic CNN architecture is utilized. Furthermore, the stronger descriptors of scenes are constructed by feature fusion. The highlight of our proposed network is that it is capable to localize discriminative regions from an image in a data-driven manner without any additional supervision. We conduct experiments on a subset of the Places205 database to evaluate the performance of the proposed basic network and the involved parameters. Our model achieves state-of-the-art top-1 accuracy 82.10% on the evaluation dataset comparing with fine-tuned PlacesCNN (80.98%). We find that our model is able to learn informative attention regions for discriminating scene categories.

[1]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[2]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[3]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[4]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[5]  Pierre Sermanet,et al.  Attention for Fine-Grained Categorization , 2014, ICLR.

[6]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Jürgen Schmidhuber,et al.  Learning to Generate Artificial Fovea Trajectories for Target Detection , 1991, Int. J. Neural Syst..

[9]  Koray Kavukcuoglu,et al.  Multiple Object Recognition with Visual Attention , 2014, ICLR.

[10]  Misha Denil,et al.  Learning Where to Attend with Deep Architectures for Image Tracking , 2011, Neural Computation.

[11]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Yizhou Yu,et al.  Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Limin Wang,et al.  Places205-VGGNet Models for Scene Recognition , 2015, ArXiv.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.