Fully Convolutional Attention Localization Networks: Efficient Attention Localization for Fine-Grained Recognition

Fine-grained recognition is challenging mainly because the inter-class differences between fine-grained classes are usually local and subtle while intra-class differences could be large due to pose variations. In order to distinguish them from intra-class variations, it is essential to zoom in on highly discriminative local regions. In this work, we introduce a reinforcement learning-based fully convolutional attention localization network to adaptively select multiple task-driven visual attention regions. We show that zooming in on the selected attention regions significantly improves the performance of fine-grained recognition. Compared to previous reinforcement learning-based models, the proposed approach is noticeably more computationally efficient during both training and testing because of its fully-convolutional architecture, and it is capable of simultaneous focusing its glimpse on multiple visual attention regions. The experiments demonstrate that the proposed method achieves notably higher classification accuracy on three benchmark fine-grained recognition datasets: Stanford Dogs, Stanford Cars, and CUB-200-2011.

[1]  Jonathan Krause,et al.  The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition , 2015, ECCV.

[2]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[3]  Pierre Sermanet,et al.  Attention for Fine-Grained Categorization , 2014, ICLR.

[4]  Ya Zhang,et al.  Augmenting Strong Supervision Using Web Data for Fine-Grained Categorization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Peter N. Belhumeur,et al.  POOF: Part-Based One-vs.-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[7]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Rong Jin,et al.  Fine-grained visual categorization via multi-stage metric learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[10]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Arnold W. M. Smeulders,et al.  Fine-Grained Categorization by Alignments , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[14]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[16]  David W. Jacobs,et al.  Dog Breed Classification Using Part Localization , 2012, ECCV.

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[19]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[20]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Tianbao Yang,et al.  Object-centric Sampling for Fine-grained Image Classification , 2014, ArXiv.

[22]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[23]  Andrew Zisserman,et al.  Symbiotic Segmentation and Part Localization for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision.

[24]  Jonathan Krause,et al.  Fine-grained recognition without part annotations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Anton van den Hengel,et al.  The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrew Zisserman,et al.  BiCoS: A Bi-level co-segmentation method for image classification , 2011, 2011 International Conference on Computer Vision.

[27]  Linda G. Shapiro,et al.  Unsupervised Template Learning for Fine-Grained Object Recognition , 2012, NIPS.

[28]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[29]  Jianfei Cai,et al.  Weakly Supervised Fine-Grained Image Categorization , 2015, ArXiv.