Beyond Bag-of-Words: Fast video classification with Fisher Kernel Vector of Locally Aggregated Descriptors

In this paper we introduce a new video description framework that replaces traditional Bag-of-Words with a combination of Fisher Kernels (FK) and Vector of Locally Aggregated Descriptors (VLAD). The main contributions are: (i) a fast algorithm to densely extract global frame features, easier and faster to compute than spatio-temporal local features; (ii) replacing the traditional k-means based vocabulary with a Random Forest approach that allows significant speedup; (iii) use of a modified VLAD and FK representation to replace the classic Bag-of-Words and obtaining better performance. We show that our framework is highly general and is not dependent on a particular type of descriptor. It achieves state-of-the-art results in several classification scenarios.

[1]  A. Smeaton,et al.  TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[2]  Mohammad Soleymani,et al.  VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation , 2014, Multimedia Tools and Applications.

[3]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[4]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Rainer Stiefelhagen,et al.  KIT at MediaEval 2012 - Content - based Genre Classification with Visual Cues , 2012, MediaEval.

[8]  Roland Göcke,et al.  Ordered Trajectories for Large Scale Human Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[9]  Thomas Sikora,et al.  TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visual)-Words Approaches , 2012, MediaEval.

[10]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Cordelia Schmid,et al.  Learning Color Names for Real-World Applications , 2009, IEEE Transactions on Image Processing.

[13]  Nicu Sebe,et al.  Time matters!: capturing variation in time in video using fisher kernels , 2013, MM '13.

[14]  Nicu Sebe,et al.  Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off , 2015, International Journal of Multimedia Information Retrieval.

[15]  Urbano Nunes,et al.  Trainable classifier-fusion schemes: An application to pedestrian detection , 2009, 2009 12th International IEEE Conference on Intelligent Transportation Systems.

[16]  Horia Cucu,et al.  ARF @ MediaEval 2012: Multimodal Video Classification , 2012, MediaEval.

[17]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Thomas Fillon,et al.  YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[19]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Markus Schedl,et al.  FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies , 2013, MediaEval.

[21]  Markus Schedl,et al.  The MediaEval 2013 Affect Task: Violent Scenes Detection , 2013, MediaEval.

[22]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[24]  Patrick Gros,et al.  Technicolor/INRIA Team at the MediaEval 2013 Violent Scenes Detection Task , 2013, MediaEval.

[25]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[26]  Martha Larson,et al.  Blip10000: a social video dataset containing SPUG content for tagging and retrieval , 2013, MMSys.

[27]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[28]  Luca Maria Gambardella,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[31]  Terumasa Aoki,et al.  TUDCL at MediaEval 2013 Violent Scenes Detection: Training with Multi-modal Features by MKL , 2013, MediaEval.

[32]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.