Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection

In object recognition, the Bag-of-Words model assumes: i) extraction of local descriptors from images, ii) embedding the descriptors by a coder to a given visual vocabulary space which results in mid-level features, iii) extracting statistics from mid-level features with a pooling operator that aggregates occurrences of visual words in images into signatures, which we refer to as First-order Occurrence Pooling. This paper investigates higher-order pooling that aggregates over co-occurrences of visual words. We derive Bag-of-Words with Higher-order Occurrence Pooling based on linearisation of Minor Polynomial Kernel, and extend this model to work with various pooling operators. This approach is then effectively used for fusion of various descriptor types. Moreover, we introduce Higher-order Occurrence Pooling performed directly on local image descriptors as well as a novel pooling operator that reduces the correlation in the image signatures. Finally, First-, Second-, and Third-order Occurrence Pooling are evaluated given various coders and pooling operators on several widely used benchmarks. The proposed methods are compared to other approaches such as Fisher Vector Encoding and demonstrate improved results.

[1]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[2]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[3]  Florent Perronnin,et al.  Modeling the spatial layout of images beyond spatial pyramids , 2012, Pattern Recognit. Lett..

[4]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[6]  K. Mikolajczyk,et al.  Higher-order Occurrence Pooling on Mid- and Low-level Features: Visual Concept Detection , 2013 .

[7]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Thomas S. Huang,et al.  Bilevel sparse coding for coupled feature spaces , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Josef Kittler,et al.  Novel Fusion Methods for Pattern Recognition , 2011, ECML/PKDD.

[10]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[12]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Atsuto Maki,et al.  From generic to specific deep representations for visual recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[15]  Hongping Cai,et al.  ℓp norm multiple kernel Fisher discriminant analysis for object and image categorisation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[17]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[18]  Stefanie Nowak,et al.  The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks , 2011, CLEF.

[19]  Motoaki Kawanabe,et al.  The Joint Submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task , 2011, CLEF.

[20]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Josef Kittler,et al.  The University of Surrey Visual Concept Detection System at ImageCLEF@ICPR: Working Notes , 2010, ICPR.

[22]  Xinnan Yu,et al.  A 2D histogram representation of images for pooling , 2011, Electronic Imaging.

[23]  Cordelia Schmid,et al.  New Software and Platforms - Convolutional Kernel Networks , 2014 .

[24]  Krystian Mikolajczyk,et al.  On a Quest for Image Descriptors Based on Unsupervised Segmentation Maps , 2010, 2010 20th International Conference on Pattern Recognition.

[25]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[27]  Krystian Mikolajczyk,et al.  Leveraging High Level Visual Information for Matching Images and Captions , 2014, ACCV.

[28]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[29]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[30]  Nozha Boujemaa,et al.  Generalized histogram intersection kernel for image recognition , 2005, IEEE International Conference on Image Processing 2005.

[31]  Krista A. Ehinger,et al.  SUN Database: Exploring a Large Collection of Scene Categories , 2014, International Journal of Computer Vision.

[32]  Krystian Mikolajczyk,et al.  Soft assignment of visual words as Linear Coordinate Coding and optimisation of its reconstruction error , 2011, 2011 18th IEEE International Conference on Image Processing.

[33]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[34]  David Picard,et al.  Compact tensor based image representation for similarity search , 2012, 2012 19th IEEE International Conference on Image Processing.

[35]  Fei-Fei Li,et al.  Combining randomization and discrimination for fine-grained image categorization , 2011, CVPR 2011.

[36]  Michael Felsberg,et al.  Semantic Pyramids for Gender and Action Recognition , 2014, IEEE Transactions on Image Processing.

[37]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Josef Kittler,et al.  Visual category recognition using Spectral Regression and Kernel Discriminant Analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[39]  Krystian Mikolajczyk,et al.  Spatial Coordinate Coding to reduce histogram representations, Dominant Angle and Colour Pyramid Match , 2011, 2011 18th IEEE International Conference on Image Processing.

[40]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[41]  Wen Gao,et al.  Group-Sensitive Multiple Kernel Learning for Object Recognition , 2012, IEEE Transactions on Image Processing.

[42]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[43]  Josef Kittler,et al.  The University of Surrey Visual Concept Detection System at ImageCLEF@ICPR: Working Notes , 2010, 2010 20th International Conference on Pattern Recognition.

[44]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[45]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[46]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[47]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[48]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[50]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[51]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Piotr. Koniusz,et al.  Novel image representations for visual categorisation with 'Bag-of-Words' , 2013 .

[53]  Michael Isard,et al.  Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[55]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[56]  Koen E. A. van de Sande,et al.  A comparison of color features for visual concept classification , 2008, CIVR '08.

[57]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[58]  Joos Vandewalle,et al.  A Multilinear Singular Value Decomposition , 2000, SIAM J. Matrix Anal. Appl..

[59]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[61]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[62]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[63]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[64]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[65]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[66]  Krystian Mikolajczyk,et al.  Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection , 2013, Comput. Vis. Image Underst..

[67]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.