Approximate Fisher Kernels of Non-iid Image Models for Image Categorization

The bag-of-words (BoW) model treats images as sets of local descriptors and represents them by visual word histograms. The Fisher vector (FV) representation extends BoW, by considering the first and second order statistics of local descriptors. In both representations local descriptors are assumed to be identically and independently distributed (iid), which is a poor assumption from a modeling perspective. It has been experimentally observed that the performance of BoW and FV representations can be improved by employing discounting transformations such as power normalization. In this paper, we introduce non-iid models by treating the model parameters as latent variables which are integrated out, rendering all local regions dependent. Using the Fisher kernel principle we encode an image by the gradient of the data log-likelihood w.r.t. the model hyper-parameters. Our models naturally generate discounting effects in the representations; suggesting that such transformations have proven successful because they closely correspond to the representations obtained for non-iid models. To enable tractable computation, we rely on variational free-energy bounds to learn the hyper-parameters and to compute approximate Fisher kernels. Our experimental evaluation results validate that our models lead to performance improvements comparable to using power normalization, as employed in state-of-the-art feature aggregation methods.

[1]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[2]  Cordelia Schmid,et al.  On the burstiness of visual elements , 2009, CVPR.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Luc Van Gool,et al.  Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[5]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[6]  Charles Elkan,et al.  Deriving TF-IDF as a Fisher Kernel , 2005, SPIRE.

[7]  Frédéric Jurie,et al.  Latent mixture vocabularies for object categorization and segmentation , 2009, Image Vis. Comput..

[8]  C. Schmid,et al.  On the burstiness of visual elements , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[10]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[11]  Jean-Cédric Chappelier,et al.  PLSI: The True Fisher Kernel and beyond , 2009, ECML/PKDD.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[14]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[15]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[17]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Lei Wang,et al.  Encoding High Dimensional Local Features by Sparse Coding Based Fisher Vectors , 2014, NIPS.

[19]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[20]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[21]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Bingbing Ni,et al.  HCP: A Flexible CNN Framework for Multi-Label Image Classification , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[24]  Patrick Pérez,et al.  Feature Learning for the Image Retrieval Task , 2014, ACCV Workshops.

[25]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[26]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[27]  Cordelia Schmid,et al.  Segmentation Driven Object Detection with Fisher Vectors , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Nebojsa Jojic,et al.  Capturing Spatial Interdependence in Image Features: The Counting Grid, an Epitomic Representation for Bags of Features , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[30]  T. Minka Estimating a Dirichlet distribution , 2012 .

[31]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[32]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[33]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[34]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Cordelia Schmid,et al.  Image categorization using Fisher kernels of non-iid image models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[37]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[38]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[40]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[41]  Florent Perronnin,et al.  Large-scale image categorization with explicit data embedding , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[42]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[43]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[44]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[45]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Nebojsa Jojic,et al.  Free energy score space , 2009, NIPS.

[47]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[49]  T. Tuytelaars,et al.  Weakly Supervised Object Detection with Posterior Regularization , 2014 .

[50]  Javier A. Redolfi,et al.  Exponential family Fisher vector for image classification , 2015, Pattern Recognit. Lett..

[51]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[52]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[53]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[54]  Takumi Kobayashi,et al.  Dirichlet-Based Histogram Feature Transform for Image Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Alessandro Perina,et al.  Expression Microarray Data Classification Using Counting Grids and Fisher Kernel , 2014, 2014 22nd International Conference on Pattern Recognition.

[56]  Frédéric Jurie,et al.  Modeling spatial layout with fisher vectors for image categorization , 2011, 2011 International Conference on Computer Vision.

[57]  W. Eric L. Grimson,et al.  Spatial Latent Dirichlet Allocation , 2007, NIPS.

[58]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .