Which and How Many Regions to Gaze: Focus Discriminative Regions for Fine-Grained Visual Categorization

Fine-grained visual categorization (FGVC) aims to discriminate similar subcategories that belong to the same superclass. Since the distinctions among similar subcategories are quite subtle and local, it is highly challenging to distinguish them from each other even for humans. So the localization of distinctions is essential for fine-grained visual categorization, and there are two pivotal problems: (1) Which regions are discriminative and representative to distinguish from other subcategories? (2) How many discriminative regions are necessary to achieve the best categorization performance? It is still difficult to address these two problems adaptively and intelligently. Artificial prior and experimental validation are widely used in existing mainstream methods to discover which and how many regions to gaze. However, their applications extremely restrict the usability and scalability of the methods. To address the above two problems, this paper proposes a multi-scale and multi-granularity deep reinforcement learning approach (M2DRL), which learns multi-granularity discriminative region attention and multi-scale region-based feature representation. Its main contributions are as follows: (1) Multi-granularity discriminative localization is proposed to localize the distinctions via a two-stage deep reinforcement learning approach, which discovers the discriminative regions with multiple granularities in a hierarchical manner (“which problem”), and determines the number of discriminative regions in an automatic and adaptive manner (“how many problem”). (2) Multi-scale representation learning helps to localize regions in different scales as well as encode images in different scales, boosting the fine-grained visual categorization performance. (3) Semantic reward function is proposed to drive M2DRL to fully capture the salient and conceptual visual information, via jointly considering attention and category information in the reward function. It allows the deep reinforcement learning to localize the distinctions in a weakly supervised manner or even an unsupervised manner. (4) Unsupervised discriminative localization is further explored to avoid the heavy labor consumption of annotating, and extremely strengthen the usability and scalability of our M2DRL approach. Compared with state-of-the-art methods on two widely-used fine-grained visual categorization datasets, our M2DRL approach achieves the best categorization accuracy.

[1]  Jonathan Krause,et al.  Fine-grained recognition without part annotations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[3]  Yuxin Peng,et al.  Object-Part Attention Model for Fine-Grained Image Classification , 2017, IEEE Transactions on Image Processing.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[6]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Donald Geman,et al.  Confidence Sets for Fine-Grained Categorization and Plant Species Identification , 2015, International Journal of Computer Vision.

[8]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Cristian Sminchisescu,et al.  Reinforcement Learning for Visual Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Qi Tian,et al.  Fused One-vs-All Features With Semantic Alignments for Fine-Grained Visual Categorization , 2016, IEEE Transactions on Image Processing.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shu Kong,et al.  Low-Rank Bilinear Pooling for Fine-Grained Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[17]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[18]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19]  Feng Zhou,et al.  Fine-Grained Image Classification by Exploring Bipartite-Graph Labels , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Dacheng Tao,et al.  Webly-Supervised Fine-Grained Visual Categorization via Deep Domain Adaptation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Gregory J. Zelinsky,et al.  Searching for camouflaged targets: Effects of target-background similarity on visual search , 2006, Vision Research.

[23]  Derrick J. Parkhurst,et al.  Modeling the role of salience in the allocation of overt visual attention , 2002, Vision Research.

[24]  Linda G. Shapiro,et al.  Unsupervised Template Learning for Fine-Grained Object Recognition , 2012, NIPS.

[25]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Cewu Lu,et al.  Deep LAC: Deep localization, alignment and classification for fine-grained recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[29]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[30]  Yongdong Zhang,et al.  Coarse-to-Fine Description for Fine-Grained Visual Categorization , 2016, IEEE Transactions on Image Processing.

[31]  Subhransu Maji,et al.  Bilinear CNNs for Fine-grained Visual Recognition , 2015 .

[32]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Jonghyun Choi,et al.  Mining Discriminative Triplets of Patches for Fine-Grained Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Peter N. Belhumeur,et al.  POOF: Part-Based One-vs.-One Features for Fine-Grained Categorization, Face Verification, and Attribute Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Yuxin Peng,et al.  Weakly Supervised Learning of Part Selection Model with Spatial Constraints for Fine-Grained Image Classification , 2017, AAAI.

[37]  Shuicheng Yan,et al.  Tree-Structured Reinforcement Learning for Sequential Object Localization , 2016, NIPS.

[38]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[39]  Andrew Zisserman,et al.  Symbiotic Segmentation and Part Localization for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Ahmed M. Elgammal,et al.  SPDA-CNN: Unifying Semantic Part Detection and Abstraction for Fine-Grained Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Weiyao Lin,et al.  Picking Neural Activations for Fine-Grained Recognition , 2017, IEEE Transactions on Multimedia.

[42]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[43]  Larry S. Davis,et al.  Weakly-supervised Discriminative Patch Learning via CNN for Fine-grained Recognition , 2016, ArXiv.

[44]  R. Baddeley,et al.  The long and the short of it: Spatial statistics at fixation vary with saccade amplitude and task , 2006, Vision Research.

[45]  Feng Zhou,et al.  Fine-Grained Categorization and Dataset Bootstrapping Using Deep Metric Learning with Humans in the Loop , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Bo Zhao,et al.  Diversified Visual Attention Networks for Fine-Grained Object Classification , 2016, IEEE Transactions on Multimedia.

[48]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[49]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[50]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[51]  Tianbao Yang,et al.  Hyper-class augmented and regularized deep learning for fine-grained image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yuxin Peng,et al.  Fine-Grained Image Classification via Combining Vision and Language , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[54]  Pietro Perona,et al.  The Ignorant Led by the Blind: A Hybrid Human–Machine Vision System for Fine-Grained Categorization , 2014, International Journal of Computer Vision.

[55]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[56]  Quan Pan,et al.  Multi-band Polarization Imaging and Applications , 2016, Advances in Computer Vision and Pattern Recognition.

[57]  Jianfei Cai,et al.  Weakly Supervised Fine-Grained Categorization With Part-Based Image Representation , 2016, IEEE Transactions on Image Processing.

[58]  Lei Zhang,et al.  Higher-Order Integration of Hierarchical Convolutional Activations for Fine-Grained Visual Categorization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Davide Modolo,et al.  Do Semantic Parts Emerge in Convolutional Neural Networks? , 2016, International Journal of Computer Vision.

[60]  Jianfei Cai,et al.  Weakly Supervised Fine-Grained Image Categorization , 2015, ArXiv.

[61]  Jonathan Krause,et al.  Learning Features and Parts for Fine-Grained Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[62]  Qi Tian,et al.  Hierarchical Part Matching for Fine-Grained Visual Categorization , 2013, 2013 IEEE International Conference on Computer Vision.

[63]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Dongbin Zhao,et al.  Deep Reinforcement Learning With Visual Attention for Vehicle Classification , 2017, IEEE Transactions on Cognitive and Developmental Systems.

[65]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[66]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[67]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[68]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[69]  Qi Tian,et al.  InterActive: Inter-Layer Activeness Propagation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Qi Tian,et al.  Picking Deep Filter Responses for Fine-Grained Image Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[72]  Xuelong Li,et al.  Detecting Densely Distributed Graph Patterns for Fine-Grained Image Categorization , 2016, IEEE Transactions on Image Processing.

[73]  Pietro Perona,et al.  Improved Bird Species Recognition Using Pose Normalized Deep Convolutional Nets , 2014, BMVC.

[74]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[75]  Zhiqiang Shen,et al.  Multiple Granularity Descriptors for Fine-Grained Categorization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[76]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[77]  Ya Zhang,et al.  Part-Stacked CNN for Fine-Grained Visual Categorization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).