论文信息 - Beyond Bag-of-Words: Fast video classification with Fisher Kernel Vector of Locally Aggregated Descriptors

Beyond Bag-of-Words: Fast video classification with Fisher Kernel Vector of Locally Aggregated Descriptors

In this paper we introduce a new video description framework that replaces traditional Bag-of-Words with a combination of Fisher Kernels (FK) and Vector of Locally Aggregated Descriptors (VLAD). The main contributions are: (i) a fast algorithm to densely extract global frame features, easier and faster to compute than spatio-temporal local features; (ii) replacing the traditional k-means based vocabulary with a Random Forest approach that allows significant speedup; (iii) use of a modified VLAD and FK representation to replace the classic Bag-of-Words and obtaining better performance. We show that our framework is highly general and is not dependent on a particular type of descriptor. It achieves state-of-the-art results in several classification scenarios.

[1] A. Smeaton,et al. TRECVID 2013 -- An Overview of the Goals, Tasks, Data, Evaluation Mechanisms, and Metrics | NIST , 2011 .

[2] Mohammad Soleymani,et al. VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation , 2014, Multimedia Tools and Applications.

[3] Georges Quénot,et al. TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[4] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[5] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7] Rainer Stiefelhagen,et al. KIT at MediaEval 2012 - Content - based Genre Classification with Visual Cues , 2012, MediaEval.

[8] Roland Göcke,et al. Ordered Trajectories for Large Scale Human Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[9] Thomas Sikora,et al. TUB @ MediaEval 2012 Tagging Task: Feature Selection Methods for Bag-of-(visual)-Words Approaches , 2012, MediaEval.

[10] Patrick Bouthemy,et al. Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11] Cordelia Schmid,et al. Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Cordelia Schmid,et al. Learning Color Names for Real-World Applications , 2009, IEEE Transactions on Image Processing.

[13] Nicu Sebe,et al. Time matters!: capturing variation in time in video using fisher kernels , 2013, MM '13.

[14] Nicu Sebe,et al. Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off , 2015, International Journal of Multimedia Information Retrieval.

[15] Urbano Nunes,et al. Trainable classifier-fusion schemes: An application to pedestrian detection , 2009, 2009 12th International IEEE Conference on Intelligent Transportation Systems.

[16] Horia Cucu,et al. ARF @ MediaEval 2012: Multimodal Video Classification , 2012, MediaEval.

[17] Yi Yang,et al. Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Thomas Fillon,et al. YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software , 2010, ISMIR.

[19] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Markus Schedl,et al. FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies , 2013, MediaEval.

[21] Markus Schedl,et al. The MediaEval 2013 Affect Task: Violent Scenes Detection , 2013, MediaEval.

[22] Cordelia Schmid,et al. Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[23] Takeo Kanade,et al. An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[24] Patrick Gros,et al. Technicolor/INRIA Team at the MediaEval 2013 Violent Scenes Detection Task , 2013, MediaEval.

[25] Limin Wang,et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[26] Martha Larson,et al. Blip10000: a social video dataset containing SPUG content for tagging and retrieval , 2013, MMSys.

[27] Thomas Mensink,et al. Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[28] Luca Maria Gambardella,et al. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Flexible, High Performance Convolutional Neural Networks for Image Classification , 2022 .

[29] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[30] Imre Csiszár,et al. Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[31] Terumasa Aoki,et al. TUDCL at MediaEval 2013 Violent Scenes Detection: Training with Multi-modal Features by MKL , 2013, MediaEval.

[32] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.