Weakly Supervised Complementary Parts Models for Fine-Grained Image Classification From the Bottom Up

Given a training dataset composed of images and corresponding category labels, deep convolutional neural networks show a strong ability in mining discriminative parts for image classification. However, deep convolutional neural networks trained with image level labels only tend to focus on the most discriminative parts while missing other object parts, which could provide complementary information. In this paper, we approach this problem from a different perspective. We build complementary parts models in a weakly supervised manner to retrieve information suppressed by dominant object parts detected by convolutional neural networks. Given image level labels only, we first extract rough object instances by performing weakly supervised object detection and instance segmentation using Mask R-CNN and CRF-based segmentation. Then we estimate and search for the best parts model for each object instance under the principle of preserving as much diversity as possible. In the last stage, we build a bi-directional long short-term memory (LSTM) network to fuze and encode the partial information of these complementary parts into a comprehensive feature for image classification. Experimental results indicate that the proposed method not only achieves significant improvement over our baseline models, but also outperforms state-of-the-art algorithms by a large margin (6.7%, 2.8%, 5.2% respectively) on Stanford Dogs 120, Caltech-UCSD Birds 2011-200 and Caltech 256.

[1]  Tao Mei,et al.  Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Jonathan Krause,et al.  The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition , 2015, ECCV.

[3]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[7]  Xuhong Li,et al.  Explicit Inductive Bias for Transfer Learning with Convolutional Networks , 2018, ICML.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Yang Song,et al.  Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Matthieu Cord,et al.  WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[13]  Ming-Hsuan Yang,et al.  Learning Spatial-Aware Regressions for Visual Tracking , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Qiang Wang,et al.  Robust Object Tracking Based on Temporal and Spatial Deep Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[16]  Xiao Liu,et al.  Fully Convolutional Attention Localization Networks: Efficient Attention Localization for Fine-Grained Recognition , 2016, ArXiv.

[17]  Bohyung Han,et al.  Learning to Select Pre-Trained Deep Representations with Bayesian Evidence Framework , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Matthieu Cord,et al.  WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[20]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[21]  Hao Su,et al.  Beyond Holistic Object Recognition: Enriching Image Understanding with Part States , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Carlo Tomasi,et al.  Features for Multi-target Multi-camera Tracking and Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[25]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[26]  Liang Lin,et al.  Multi-label Image Recognition by Recurrently Discovering Attentional Regions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Errui Ding,et al.  Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition , 2018, ECCV.

[29]  Larry S. Davis,et al.  Learning a Discriminative Filter Bank Within a CNN for Fine-Grained Recognition , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Yizhou Yu,et al.  Borrowing Treasures from the Wealthy: Deep Transfer Learning through Selective Joint Fine-Tuning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[35]  Luc Van Gool,et al.  Weakly Supervised Cascaded Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[37]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[38]  Michael Lam,et al.  Fine-Grained Recognition as HSnet Search for Informative Image Parts , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Forrest N. Iandola,et al.  Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Huimin Ma,et al.  Weakly-Supervised Semantic Segmentation by Iteratively Mining Common Object Features , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Thomas Brox,et al.  Universität Des Saarlandes Fachrichtung 6.1 – Mathematik Level Set Segmentation with Multiple Regions Level Set Segmentation with Multiple Regions , 2022 .

[45]  Xinge You,et al.  Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition , 2018, ECCV.

[46]  Yizhou Yu,et al.  Multi-evidence Filtering and Fusion for Multi-label Classification, Object Detection and Semantic Segmentation Based on Weakly Supervised Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[48]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Julien Mairal,et al.  BlitzNet: A Real-Time Deep Network for Scene Understanding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).