Bilinear Convolutional Neural Networks for Fine-Grained Visual Recognition

We present a simple and effective architecture for fine-grained recognition called <italic>Bilinear Convolutional Neural Networks (B-CNNs)</italic>. These networks represent an image as a <italic>pooled outer product</italic> of features derived from two CNNs and capture localized feature interactions in a translationally invariant manner. B-CNNs are related to orderless texture representations built on deep features but can be trained in an end-to-end manner. Our most accurate model obtains 84.1, 79.4, 84.5 and 91.3 percent per-image accuracy on the Caltech-UCSD birds <xref ref-type="bibr" rid="ref1">[1]</xref> , NABirds <xref ref-type="bibr" rid="ref2">[2]</xref> , FGVC aircraft <xref ref-type="bibr" rid="ref3">[3]</xref> , and Stanford cars <xref ref-type="bibr" rid="ref4">[4]</xref> dataset respectively and runs at 30 frames-per-second on a NVIDIA Titan X GPU. We then present a systematic analysis of these networks and show that (1) the bilinear features are highly redundant and can be reduced by an order of magnitude in size without significant loss in accuracy, (2) are also effective for other image classification tasks such as texture and scene recognition, and (3) can be trained from scratch on the ImageNet dataset offering consistent improvements over the baseline architecture. Finally, we present visualizations of these models on various datasets using top activations of neural units and gradient-based inversion techniques. The source code for the complete system is available at <uri>http://vis-www.cs.umass.edu/bcnn</uri>.

[1]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[2]  Pedro M. Domingos,et al.  Discriminative Learning of Sum-Product Networks , 2012, NIPS.

[3]  Jonathan Krause,et al.  Fine-grained recognition without part annotations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Pietro Perona,et al.  Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[7]  Leon A. Gatys,et al.  Texture Synthesis Using Shallow Convolutional Networks with Random Filters , 2016, ArXiv.

[8]  Andrea Vedaldi,et al.  Visualizing Deep Convolutional Neural Networks Using Natural Pre-images , 2015, International Journal of Computer Vision.

[9]  Jonathan Krause,et al.  The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition , 2015, ECCV.

[10]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[11]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Trevor Darrell,et al.  Pose pooling kernels for sub-category recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Larry S. Davis,et al.  Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance , 2011, 2011 International Conference on Computer Vision.

[20]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[21]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Barbara Caputo,et al.  Class-Specific Material Categorisation , 2005, ICCV.

[24]  Leon A. Gatys,et al.  Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks , 2015, ArXiv.

[25]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Jitendra Malik,et al.  Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons , 2001, International Journal of Computer Vision.

[27]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yang Gao,et al.  Fine-grained pose prediction, normalization, and recognition , 2015, ArXiv.

[29]  Chunhua Shen,et al.  Cross-Convolutional-Layer Pooling for Image Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[31]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[32]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[33]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[34]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[36]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[37]  Naila Murray,et al.  Revisiting the Fisher vector for fine-grained classification , 2014, Pattern Recognit. Lett..

[38]  Jonghyun Choi,et al.  Mining Discriminative Triplets of Patches for Fine-Grained Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Edward H. Adelson,et al.  Material perception: What can you see in a brief glance? , 2010 .

[40]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[41]  Subhransu Maji,et al.  Visualizing and Understanding Deep Texture Representations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[46]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[47]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[48]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[49]  Marcel Simon,et al.  Neural Activation Constellations: Unsupervised Part Model Discovery with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[52]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[53]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[54]  Tomás Pajdla,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Charless C. Fowlkes,et al.  Bilinear classifiers for visual recognition , 2009, NIPS.

[56]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[57]  Ahmed M. Elgammal,et al.  SPDA-CNN: Unifying Semantic Part Detection and Abstraction for Fine-Grained Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[59]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[60]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[61]  Qi Tian,et al.  Picking Deep Filter Responses for Fine-Grained Image Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[63]  Cristian Sminchisescu,et al.  Matrix Backpropagation for Deep Networks with Structured Layers , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[65]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[66]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[67]  Jian Yang,et al.  Boosted Convolutional Neural Networks , 2016, BMVC.

[68]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[69]  Andrew Zisserman,et al.  A Compact and Discriminative Face Track Descriptor , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Iasonas Kokkinos,et al.  Deep Filter Banks for Texture Recognition, Description, and Segmentation , 2015, International Journal of Computer Vision.