Where to Focus: Deep Attention-based Spatially Recurrent Bilinear Networks for Fine-Grained Visual Recognition

Fine-grained visual recognition typically depends on modeling subtle difference from object parts. However, these parts often exhibit dramatic visual variations such as occlusions, viewpoints, and spatial transformations, making it hard to detect. In this paper, we present a novel attention-based model to automatically, selectively and accurately focus on critical object regions with higher importance against appearance variations. Given an image, two different Convolutional Neural Networks (CNNs) are constructed, where the outputs of two CNNs are correlated through bilinear pooling to simultaneously focus on discriminative regions and extract relevant features. To capture spatial distributions among the local regions with visual attention, soft attention based spatial Long-Short Term Memory units (LSTMs) are incorporated to realize spatially recurrent yet visually selective over local input patterns. All the above intuitions equip our network with the following novel model: two-stream CNN layers, bilinear pooling layer, spatial recurrent layer with location attention are jointly trained via an end-to-end fashion to serve as the part detector and feature extractor, whereby relevant features are localized and extracted attentively. We show the significance of our network against two well-known visual recognition tasks: fine-grained image classification and person re-identification.

[1]  Lin Wu,et al.  Clustering via geometric median shift over Riemannian manifolds , 2013, Inf. Sci..

[2]  Lin Wu,et al.  Iterative Views Agreement: An Iterative Low-Rank Based Structured Optimization Method to Multi-View Spectral Clustering , 2016, IJCAI.

[3]  E.E. Pissaloux,et al.  Image Processing , 1994, Proceedings. Second Euromicro Workshop on Parallel and Distributed Processing.

[4]  Naila Murray,et al.  Revisiting the Fisher vector for fine-grained classification , 2014, Pattern Recognit. Lett..

[5]  Victor S. Lempitsky,et al.  Multi-Region bilinear convolutional neural networks for person re-identification , 2015, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[6]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[7]  Xiaogang Wang,et al.  Person Re-identification by Salience Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  E. Silerova,et al.  Knowledge and information systems , 2018 .

[9]  Lin Wu,et al.  Robust Hashing for Multi-View Data: Jointly Learning Low-Rank Kernelized Similarity Consensus and Hash Functions , 2016, Image Vis. Comput..

[10]  Lin Wu,et al.  Exploiting Attribute Correlations: A Novel Trace Lasso-Based Weakly Supervised Dictionary Learning Method , 2017, IEEE Transactions on Cybernetics.

[11]  Lin Wu,et al.  Robust Subspace Clustering for Multi-View Data by Exploiting Correlation Consensus , 2015, IEEE Transactions on Image Processing.

[12]  Lin Wu,et al.  Multiview Spectral Clustering via Structured Low-Rank Matrix Factorization , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Ivan Laptev,et al.  Object Detection Using Strongly-Supervised Deformable Part Models , 2012, ECCV.

[14]  Marek R. Ogiela,et al.  Multimedia tools and applications , 2005, Multimedia Tools and Applications.

[15]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[16]  Jianfei Cai,et al.  Weakly Supervised Fine-Grained Categorization With Part-Based Image Representation , 2016, IEEE Transactions on Image Processing.

[17]  Lin Wu,et al.  PersonNet: Person Re-identification with Deep Convolutional Neural Networks , 2016, ArXiv.

[18]  Jiwen Lu,et al.  Nonlinear Local Metric Learning for Person Re-identification , 2015, ArXiv.

[19]  Yang Wang,et al.  Structured Deep Hashing with Convolutional Neural Networks for Fast Person Re-identification , 2017, Comput. Vis. Image Underst..

[20]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[21]  Lin Wu,et al.  Deep Linear Discriminant Analysis on Fisher Networks: A Hybrid Architecture for Person Re-identification , 2016, Pattern Recognit..

[22]  Andrew Zisserman,et al.  Symbiotic Segmentation and Part Localization for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Jürgen Schmidhuber,et al.  Multidimensional Recurrent Neural Networks , 2007 .

[24]  Lin Wu,et al.  Max-sum diversification on image ranking with non-uniform matroid constraints , 2013, Neurocomputing.

[25]  Hai Tao,et al.  Evaluating Appearance Models for Recognition, Reacquisition, and Tracking , 2007 .

[26]  Lin Wu,et al.  Deep adaptive feature embedding with local sample distributions for person re-identification , 2017, Pattern Recognit..

[27]  Lin Wu,et al.  Shifting Hypergraphs by Probabilistic Voting , 2014, PAKDD.

[28]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[29]  Ya Zhang,et al.  Part-Stacked CNN for Fine-Grained Visual Categorization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  T. Munich,et al.  Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks , 2008, NIPS.

[31]  Lin Wu,et al.  Shifting multi-hypergraphs via collaborative probabilistic voting , 2015, Knowledge and Information Systems.

[32]  Lin Wu,et al.  Effective Multi-Query Expansions: Robust Landmark Retrieval , 2015, ACM Multimedia.

[33]  Nanning Zheng,et al.  Similarity Learning with Spatial Constraints for Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Lin Wu,et al.  Unsupervised Metric Fusion Over Multiview Data by Graph Random Walk-Based Cross-View Diffusion , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[37]  Lin Wu,et al.  What-and-Where to Match: Deep Spatially Multiplicative Integration Networks for Person Re-identification , 2017, Pattern Recognit..

[38]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[40]  Lin Wu,et al.  Efficient image and tag co-ranking: a bregman divergence optimization method , 2013, ACM Multimedia.

[41]  Jianxin Wu,et al.  Person Re-Identification with Correspondence Structure Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Alessandro Perina,et al.  Person re-identification by symmetry-driven accumulation of local features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  Lin Wu,et al.  Exploiting Correlation Consensus: Towards Subspace Clustering for Multi-modal Data , 2014, ACM Multimedia.

[44]  Lin Wu,et al.  Effective Multi-Query Expansions: Collaborative Deep Networks for Robust Landmark Retrieval , 2017, IEEE Transactions on Image Processing.

[45]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[47]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Jonathan Krause,et al.  Fine-grained recognition without part annotations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Michael Jones,et al.  An improved deep learning architecture for person re-identification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  P. Cavanagh Visual cognition , 2011, Vision Research.

[51]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[52]  Lin Wu,et al.  Beyond Low-Rank Representations: Orthogonal Clustering Basis Reconstruction with Optimized Graph Structure for Multi-view Spectral Clustering , 2017, Neural Networks.

[53]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[54]  Lin Wu,et al.  An efficient framework of Bregman divergence optimization for co-ranking images and tags in a heterogeneous network , 2015, Multimedia Tools and Applications.

[55]  Pietro Perona,et al.  Bird Species Categorization Using Pose Normalized Deep Convolutional Nets , 2014, ArXiv.

[56]  Victor Lempitsky,et al.  Multiregion Bilinear Convolutional Neural Networks for Person Re-Identification , 2015 .

[57]  Lin Wu,et al.  LBMCH: Learning Bridging Mapping for Cross-modal Hashing , 2015, SIGIR.

[58]  Trevor Darrell,et al.  Part-Based R-CNNs for Fine-Grained Category Detection , 2014, ECCV.

[59]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.