A Study on the Cardinality of Ordered Average Pooling in Visual Recognition

Bag-of-Words methods can be robust to image scaling, translation, and occlusion. An important step in this methodology, and other visual recognition systems like Convolutional Neural Networks, is spatial pooling, where the descriptors of neighbouring elements are combined into a local or a global feature vector. The combined vector must contain relevant information, while removing irrelevant and confusing details. Maximum and average are the most common aggregation functions used in the pooling step. In this work we present a study about the cardinality of ordered average pooling, i.e. the number of ordered elements to be aggregated such that after the pooling process the relevant information is maintained without degrading their discriminative power for classification. We provide an extensive evaluation that shows that for different values of cardinalities we can obtain results better than simple average pooling and than maximum pooling when dealing with small dictionary sizes.

[1]  Chong Wang,et al.  How to use Bag-of-Words model better for image classification , 2015, Image Vis. Comput..

[2]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[4]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[5]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[6]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[8]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Krystian Mikolajczyk,et al.  Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection , 2013, Comput. Vis. Image Underst..

[10]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.