Weakly Supervised Learning of Object-Part Attention Model for Fine-Grained Image Classification

Fine-grained classification is challengeable due to the small inter-class variance and large intra-class distance between fine-grained categories. The key to solve this problem is to locate the discriminative part in the image. In this paper we propose a weakly supervised method, which only need image-level label for fine-grained classification. In our model, the convolutional neural network (CNN) can location the discriminative region through attention, and automatically focus on subtler features by zooming the discriminative region and feeding it to the next CNN. A Squeeze and Excitation (SE) module is employed for channel-wise attention, and a spatial constrain loss is utilized to keep the diversity of located part. We conduct experiments on CUB-2011-200, Stanford Dogs, and Stanford Cars datasets to evaluate the performance of our model. The experimental results demonstrate the effectiveness of the proposed method as compared other methods.

[1]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[2]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[3]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[4]  Xiao Liu,et al.  Fully Convolutional Attention Localization Networks: Efficient Attention Localization for Fine-Grained Recognition , 2016, ArXiv.

[5]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[7]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[8]  Subhransu Maji,et al.  Bilinear Convolutional Neural Networks for Fine-Grained Visual Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ahmed M. Elgammal,et al.  SPDA-CNN: Unifying Semantic Part Detection and Abstraction for Fine-Grained Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xiao Liu,et al.  Kernel Pooling for Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Zhiqiang Shen,et al.  Multiple Granularity Descriptors for Fine-Grained Categorization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Ya Zhang,et al.  Part-Stacked CNN for Fine-Grained Visual Categorization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Weiyao Lin,et al.  Picking Neural Activations for Fine-Grained Recognition , 2017, IEEE Transactions on Multimedia.

[14]  Bo Zhao,et al.  Diversified Visual Attention Networks for Fine-Grained Object Classification , 2016, IEEE Transactions on Multimedia.

[15]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[16]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[17]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.