Multi-scale Orderless Pooling of Deep Convolutional Activation Features

Deep convolutional neural networks (CNN) have shown their promise as a universal representation for recognition. However, global CNN activations lack geometric invariance, which limits their robustness for classification and matching of highly variable scenes. To improve the invariance of CNN activations without degrading their discriminative power, this paper presents a simple but effective scheme called multi-scale orderless pooling (MOP-CNN). This scheme extracts CNN activations for local patches at multiple scale levels, performs orderless VLAD pooling of these activations at each level separately, and concatenates the result. The resulting MOP-CNN representation can be used as a generic feature for either supervised or unsupervised recognition tasks, from image classification to instance-level retrieval; it consistently outperforms global CNN activations without requiring any joint training of prediction layers for a particular target dataset. In absolute terms, it achieves state-of-the-art results on the challenging SUN397 and MIT Indoor Scenes classification datasets, and competitive results on ILSVRC2012/2013 classification and INRIA Holidays retrieval datasets.

[1]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[2]  IEEE conference on computer vision and pattern recognition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[3]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[5]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[6]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[7]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search , 2008, ECCV.

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[12]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[14]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[15]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Florent Perronnin,et al.  Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[19]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[20]  Ernest Valveny,et al.  Leveraging category-level labels for instance-level image retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Cordelia Schmid,et al.  Towards good practice in large-scale learning for image classification , 2012, CVPR.

[22]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[24]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[25]  Cordelia Schmid,et al.  Good Practice in Large-Scale Learning for Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Lorenzo Torresani,et al.  Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[28]  Fei-Fei Li,et al.  Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going? , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Deep Fisher Networks for Large-Scale Image Classification , 2013, NIPS.

[31]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Yannis Avrithis,et al.  To Aggregate or Not to aggregate: Selective Match Kernels for Image Search , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[34]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[35]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[36]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Neural Networks , 2013 .

[37]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[38]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[40]  Forrest N. Iandola,et al.  DenseNet: Implementing Efficient ConvNet Descriptor Pyramids , 2014, ArXiv.

[41]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[42]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[43]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[44]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.