Pooling in image representation: The visual codeword point of view

In this work, we propose BossaNova, a novel representation for content-based concept detection in images and videos, which enriches the Bag-of-Words model. Relying on the quantization of highly discriminant local descriptors by a codebook, and the aggregation of those quantized descriptors into a single pooled feature vector, the Bag-of-Words model has emerged as the most promising approach for concept detection on visual documents. BossaNova enhances that representation by keeping a histogram of distances between the descriptors found in the image and those in the codebook, preserving thus important information about the distribution of the local descriptors around each codeword. Contrarily to other approaches found in the literature, the non-parametric histogram representation is compact and simple to compute. BossaNova compares well with the state-of-the-art in several standard datasets: MIRFLICKR, ImageCLEF 2011, PASCAL VOC 2007 and 15-Scenes, even without using complex combinations of different local descriptors. It also complements well the cutting-edge Fisher Vector descriptors, showing even better results when employed in combination with them. BossaNova also shows good results in the challenging real-world application of pornography detection.

[1]  Koen E. A. van de Sande,et al.  Evaluation of color descriptors for object and scene recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[3]  Raj Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[4]  Stefanie Nowak,et al.  The CLEF 2011 Photo Annotation and Concept-based Retrieval Tasks , 2011, CLEF.

[5]  Frédéric Jurie,et al.  Modeling spatial layout with fisher vectors for image categorization , 2011, 2011 International Conference on Computer Vision.

[6]  W. Kelly,et al.  Screening for Objectionable Images: A Review of Skin Detection Techniques , 2008, 2008 International Machine Vision and Image Processing Conference.

[7]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[8]  Matthieu Cord,et al.  Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval , 2009, J. Electronic Imaging.

[9]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[10]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[11]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[12]  Matthieu Cord,et al.  Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval (Cognitive Technologies) , 2008 .

[13]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[14]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[16]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[17]  Matthieu Cord,et al.  Extended Coding and Pooling in the HMAX Model , 2013, IEEE Transactions on Image Processing.

[18]  Matthieu Cord,et al.  BOSSA: Extended bow formalism for image classification , 2011, 2011 18th IEEE International Conference on Image Processing.

[19]  Jean Ponce,et al.  Learning mid-level features for recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  David Picard,et al.  Improving image similarity with vectors of locally aggregated tensors , 2011, 2011 18th IEEE International Conference on Image Processing.

[21]  Cordelia Schmid,et al.  Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Christian Hentschel,et al.  Sample Selection, Category Specific Features and Reasoning , 2011, CLEF.

[23]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[24]  Giovanni Maria Farinella,et al.  MACHINE LEARNING IN COMPUTER VISION , 2002 .

[25]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  Hermann Ney,et al.  Bag-of-visual-words models for adult image classification and filtering , 2008, 2008 19th International Conference on Pattern Recognition.

[28]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[29]  Frédéric Jurie,et al.  Semantic Contexts and Fisher Vectors for the ImageCLEF 2011 Photo Annotation Task , 2011, CLEF.

[30]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[31]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[32]  Matthieu Cord,et al.  Learning geometric combinations of Gaussian kernels with alternating Quasi-Newton algorithm , 2012, ESANN.

[33]  Duy-Dinh Le,et al.  NII, Japan at ImageCLEF 2011 Photo Annotation Task , 2011, CLEF.

[34]  Matthieu Cord,et al.  RETIN: A Content-Based Image Indexing and Retrieval System , 2001, Pattern Analysis & Applications.

[35]  Matthieu Cord,et al.  Unsupervised and Supervised Visual Codes with Restricted Boltzmann Machines , 2012, ECCV.

[36]  Matthieu Cord,et al.  An efficient system for combining complementary kernels in complex visual categorization tasks , 2010, 2010 IEEE International Conference on Image Processing.

[37]  Nicolas Le Roux,et al.  Ask the locals: Multi-way local pooling for image recognition , 2011, 2011 International Conference on Computer Vision.

[38]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Matthieu Cord,et al.  Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval , 2008, Comput. Vis. Image Underst..

[40]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[41]  Raj Jain,et al.  The Art of Computer Systems Performance Analysis : Tech-niques for Experimental Design , 1991 .

[42]  B. S. Manjunath,et al.  NeTra: A toolbox for navigating large image databases , 1997, Multimedia Systems.

[43]  Koen E. A. van de Sande,et al.  The University of Amsterdam's Concept Detection System at ImageCLEF 2009 , 2009, CLEF.

[44]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[45]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[46]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[47]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[48]  J. Orav,et al.  Sample selection. , 1995, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[49]  Motoaki Kawanabe,et al.  The Joint Submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo Annotation Task , 2011, CLEF.

[50]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[53]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[55]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.