Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams

In this paper we focus on event recognition in visual image streams. More specifically, we aim to construct a compact representation which encodes the diversity of the visual stream from just a few observations. For this purpose, we introduce the Event Fisher Vector, a Fisher Kernel based representation to describe a collection of images or the sequential frames of a video. We explore different generative models beyond the Gaussian mixture model as underlying probability distribution. First, the Student’s-t mixture model which captures the heavy tails of the small sample size of a collection of images. Second, Hidden Markov Models to explicitly capture the temporal ordering of the observations in a stream. For all our models we derive analytical approximations of the Fisher information matrix, which significantly improves recognition performance. We extensively evaluate the properties of our proposed method on three recent datasets for event recognition in photo collections and web videos, leading to an efficient compact image representation which achieves state-of-the-art performance on all these datasets.

[1]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[2]  Geoffrey J. McLachlan,et al.  Robust mixture modelling using the t distribution , 2000, Stat. Comput..

[3]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Won Jong Jeon,et al.  Spatio-temporal pyramid matching for sports videos , 2008, MIR '08.

[6]  Jiebo Luo,et al.  Event recognition from photo collections via PageRank , 2009, MM '09.

[7]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[9]  Nicu Sebe,et al.  Exploitation of time constraints for (sub-)event recognition , 2011, J-MRE '11.

[10]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[11]  Florent Perronnin,et al.  Modeling the spatial layout of images beyond spatial pyramids , 2012, Pattern Recognit. Lett..

[12]  Cordelia Schmid,et al.  Image categorization using Fisher kernels of non-iid image models , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Dong Liu,et al.  Robust late fusion with rank minimization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[15]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Nicu Sebe,et al.  Time matters!: capturing variation in time in video using fisher kernels , 2013, MM '13.

[17]  Andrew Zisserman,et al.  Deep Fisher Networks for Large-Scale Image Classification , 2013, NIPS.

[18]  Dong Liu,et al.  Sample-Specific Late Fusion for Visual Category Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[20]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Matthieu Guillaumin,et al.  Event Recognition in Photo Collections with a Stopwatch HMM , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Ramakant Nevatia,et al.  Large-scale web video event classification by use of Fisher Vectors , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[23]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[24]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[25]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[26]  Pong C. Yuen,et al.  Reduced Analytic Dependency Modeling: Robust Fusion for Visual Recognition , 2014, International Journal of Computer Vision.

[27]  Ming-Syan Chen,et al.  Video Event Detection by Inferring Temporal Instance Labels , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Jun Wang,et al.  Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification , 2014, ACM Multimedia.