RAMS-Trans: Recurrent Attention Multi-scale Transformer for Fine-grained Image Recognition

In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored extensively convolutional neural networks (CNNs) based approaches. The recently developed vision transformer (ViT) has achieved promising results in computer vision tasks. Compared with CNNs, Image sequentialization is a brand new manner. However, ViT is limited in its receptive field size and thus lacks local attention like CNNs due to the fixed size of its patches, and is unable to generate multi-scale features to learn discriminative region attention. To facilitate the learning of discriminative region attention without box/part annotations, we use the strength of the attention weights to measure the importance of the patch tokens corresponding to the raw images. We propose the recurrent attention multi-scale transformer (RAMS-Trans), which uses the transformer's self-attention to recursively learn discriminative region attention in a multi-scale manner. Specifically, at the core of our approach lies the dynamic patch proposal module (DPPM) responsible for guiding region amplification to complete the integration of multi-scale image patches. The DPPM starts with the full-size image patches and iteratively scales up the region attention to generate new patches from global to local by the intensity of the attention weights generated at each scale as an indicator. Our approach requires only the attention weights that come with ViT itself and can be easily trained end-to-end. Extensive experiments demonstrate that RAMS-Trans performs better than exising works, in addition to efficient CNN models, achieving state-of-the-art results on three benchmark datasets.

[1]  Li Yang,et al.  Economy of China Analysis and Forecast (2019) , 2019 .

[2]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[3]  Ramesh Raskar,et al.  Maximum-Entropy Fine-Grained Classification , 2018, NeurIPS.

[4]  Larry S. Davis,et al.  Cross-X Learning for Fine-Grained Visual Categorization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[6]  Yin Li,et al.  Interpretable and Accurate Fine-grained Recognition via Region Grouping , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[10]  Philip H. S. Torr,et al.  Learn To Pay Attention , 2018, ICLR.

[11]  Nanning Zheng,et al.  Grassmann Pooling as Compact Homogeneous Bilinear Pooling for Fine-Grained Visual Classification , 2018, ECCV.

[12]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[13]  Alan Yuille,et al.  TransFG: A Transformer Architecture for Fine-grained Recognition , 2021, AAAI.

[14]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[15]  Weilin Huang,et al.  Channel Interaction Networks for Fine-Grained Image Categorization , 2020, AAAI.

[16]  Xiao Liu,et al.  Kernel Pooling for Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Michael Lam,et al.  Fine-Grained Recognition as HSnet Search for Informative Image Parts , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zheng-Jun Zha,et al.  Filtration and Distillation: Enhancing Region Attention for Fine-Grained Visual Categorization , 2020, AAAI.

[19]  Dong Wang,et al.  Learning to Navigate for Fine-grained Classification , 2018, ECCV.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Qixiang Ye,et al.  Selective Sparse Sampling for Fine-Grained Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xiu-Shen Wei,et al.  Mask-CNN: Localizing Parts and Selecting Descriptors for Fine-Grained Image Recognition , 2016, ArXiv.

[24]  Yali Wang,et al.  Learning Attentive Pairwise Interaction for Fine-Grained Classification , 2020, AAAI.

[25]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[26]  Tao Mei,et al.  Destruction and Construction Learning for Fine-Grained Image Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[28]  Larry S. Davis,et al.  Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Xiu-Shen Wei,et al.  Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval , 2016, IEEE Transactions on Image Processing.

[30]  Fan Zhang,et al.  Multi-branch and Multi-scale Attention Learning for Fine-Grained Visual Categorization , 2021, MMM.

[31]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Yi-Zhe Song,et al.  Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches , 2020, ECCV.

[34]  Tao Mei,et al.  Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[36]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Wojciech Matusik,et al.  Learning to Zoom: a Saliency-Based Sampling Layer for Neural Networks , 2018, ECCV.

[39]  Feiyue Huang,et al.  Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yizhou Yu,et al.  Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[42]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jiebo Luo,et al.  Learning Deep Bilinear Transformation for Fine-grained Image Representation , 2019, NeurIPS.

[46]  Jiebo Luo,et al.  Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).