Fisher vector with weakly-supervised Gaussian dictionary for scene classification

The Fisher Vector (FV) is a very successful image representing method, which has achieved the state-of-the-art performance on scene classification. It concatenates the gradient of parameters in generative model as the image representation, which takes the advantage of generative and discriminative models. Using Gaussian mixture model (GMM) as the dictionary model, it can be regarded as an extension of the Bag-of-Words (BoW). But using unsupervised GMM to learn the dictionary makes a great loss for the information of image labels, which counts a lot for discrimination. To address the problem, we propose a novel strategy named Weakly-Supervised Gaussian Dictionary for Fisher Vector (WSGD-FV) to get the image representation. Specifically, we first use the weakly-supervised method to learn the Gaussian words, and then we combine these words to a Gaussian dictionary as the probability density function, so we can use this function to generate the FV. Our method is shown to get much better performance than the conventional FV for scene classification.

[1]  Dieter Fox,et al.  Kernel Descriptors for Visual Recognition , 2010, NIPS.

[2]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[3]  Jean Ponce,et al.  Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[5]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[7]  Wenyu Liu,et al.  Feature context for image classification and object detection , 2011, CVPR 2011.

[8]  Zhuowen Tu,et al.  Max-Margin Multiple-Instance Dictionary Learning , 2013, ICML.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[11]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[12]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[13]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[14]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[15]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[17]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[20]  Tu Bao Ho,et al.  Latent Dirichlet Allocationを用いた顔表情からの教師なし学習(バイオメトリクス,一般) , 2014 .

[21]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.