How to use Bag-of-Words model better for image classification

The Bag-of-Words (BoW) framework is well-known in image classification. In the framework, there are two essential steps: 1) coding, which encodes local features by a visual vocabulary, and 2) pooling, which pools over the response of all features into image representation. Many coding and pooling methods are proposed, and how to apply them better in different conditions has become a practical problem. In this paper, to better use BoW in different applications, we study the relation between many typical coding methods and two popular pooling methods. Specifically, complete combinations of coding and pooling are evaluated based on an extremely large range of vocabulary sizes (16 to 1M) on five primary and popular datasets. Three typical ones are 15 Scenes, Caltech 101 and PASCAL VOC 2007, while the other two large-scale ones are Caltech 256 and ImageNet. Based on the systematic evaluation, some interesting conclusions are drawn. Some conclusions are the extensions of previous viewpoints, while some are different but important to understand BoW model. Based on these conclusions, we provide detailed application criterions by evaluating coding and pooling based on precision, efficiency and memory requirements in different applications. We hope that this study can be helpful to evaluate different coding and pooling methods, the conclusions can be beneficial to better understand BoW, and the application criterions can be valuable to use BoW better in different applications. We empirically study coding and pooling under a large range of vocabulary sizes.We provide detailed application guidelines to use BoW in practical applications.Combined with average pooling, most coding methods are comparable.The best performance of max pooling is better than the one of average pooling.A saturant point exists in maximum pooling.

[1]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[2]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[3]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[5]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[6]  Nicolas Le Roux,et al.  Ask the locals: Multi-way local pooling for image recognition , 2011, 2011 International Conference on Computer Vision.

[7]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[8]  Florent Perronnin,et al.  High-dimensional signature compression for large-scale image classification , 2011, CVPR 2011.

[9]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[10]  Xin Zhao,et al.  Feature coding via vector difference for image classification , 2012, 2012 19th IEEE International Conference on Image Processing.

[11]  Chong Wang,et al.  Exploring relations of visual codes for image classification , 2011, CVPR 2011.

[12]  Shin'ichi Satoh,et al.  Compact correlation coding for visual object categorization , 2011, 2011 International Conference on Computer Vision.

[13]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[15]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Tong Zhang,et al.  Improved Local Coordinate Coding using Local Tangents , 2010, ICML.

[18]  Tieniu Tan,et al.  Salient coding for image classification , 2011, CVPR 2011.

[19]  P. Lennie Receptive fields , 2003, Current Biology.

[20]  Baoxin Li,et al.  Discriminative affine sparse codes for image classification , 2011, CVPR 2011.

[21]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Wei Xu,et al.  Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent , 2011, ArXiv.

[23]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[24]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[25]  Andrea J. van Doorn,et al.  The Structure of Locally Orderless Images , 1999, International Journal of Computer Vision.

[26]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[27]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[28]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[31]  Thomas S. Huang,et al.  Supervised translation-invariant sparse coding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Ming Yang,et al.  Large-scale image classification: Fast feature extraction and SVM training , 2011, CVPR 2011.

[33]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[34]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[35]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[36]  Hervé Le Borgne,et al.  Locality-constrained and spatially regularized coding for scene categorization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[38]  Thomas S. Huang,et al.  Efficient Highly Over-Complete Sparse Coding Using a Mixture Model , 2010, ECCV.

[39]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[40]  Tieniu Tan,et al.  Feature Coding in Image Classification: A Comprehensive Study , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Frank Nielsen,et al.  On weighting clustering , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  T. Velmurugan,et al.  Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points , 2010 .

[43]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[44]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[45]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Thomas Serre,et al.  Object recognition with features inspired by visual cortex , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[47]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[50]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[51]  Hujun Bao,et al.  Sparse concept coding for visual analysis , 2011, CVPR 2011.

[52]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[53]  Andrew Y. Ng,et al.  The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization , 2011, ICML.

[54]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[55]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[56]  David Picard,et al.  Improving image similarity with vectors of locally aggregated tensors , 2011, 2011 18th IEEE International Conference on Image Processing.