Deep Spatial Pyramid: The Devil is Once Again in the Details

In this paper we show that by carefully making good choices for various detailed but important factors in a visual recognition framework using deep learning features, one can achieve a simple, efficient, yet highly accurate image classification system. We first list 5 important factors, based on both existing researches and ideas proposed in this paper. These important detailed factors include: 1) ‘2 matrix normalization is more effective than unnormalized or‘2 vector normalization, 2) the proposed natural deep spatial pyramid is very effective, and 3) a very small K in Fisher Vectors surprisingly achieves higher accuracy than normally used large K values. Along with other choices (convolutional activations and multiple scales), the proposed DSP framework is not only intuitive and efficient, but also achieves excellent classification accuracy on many benchmark datasets. For example, DSP’s accuracy on SUN397 is 59.78%, significantly higher than previous state-of-the-art (53.86%).

[1]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[4]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[6]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[11]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[12]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[13]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Subhransu Maji,et al.  Deep convolutional filter banks for texture recognition and segmentation , 2014, ArXiv.

[15]  Jianxin Wu,et al.  Towards Good Practices for Action Video Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[17]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[18]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[19]  Anton van den Hengel,et al.  The treasure beneath convolutional layers: Cross-convolutional-layer pooling for image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[23]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  James M. Rehg,et al.  CENTRIST: A Visual Descriptor for Scene Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[28]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[29]  In-So Kweon,et al.  Fisher Kernel for Deep Neural Activations , 2014, ArXiv.

[30]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.