Boosting VLAD with double assignment using deep features for action recognition in videos

The encoding method is an important factor for an action recognition pipeline. One of the key points for the encoding method is the assignment step. A very widely used super-vector encoding method is the vector of locally aggregated descriptors (VLAD), with very competitive results in many tasks. However, it considers only hard assignment and the criteria for the assignment is performed only from the features side, by looking for which visual word the features are voting. In this work we propose to encode deep features for videos using a double assignment VLAD (DA-VLAD). In addition to the traditional assignment for VLAD we perform a second assignment by taking into account the perspective from the codebook side: which are the nearest features to a visual word and not only which is the nearest centroid for the features as the standard assignment. Another important factor for the performance of an action recognition system is the feature extraction step. Recently, deep features obtained state-of-the-art results in many tasks, being also adopted for action recognition with competitive results over hand-crafted features. This work includes a pipeline to extract local deep features for videos using any available network as a black box and we show competitive results including the case when the network was trained for another task or another dataset. Our DA-VLAD encoding method outperforms the traditional VLAD and we obtain state-of-the-art results on UCF50 dataset and competitive results on UCF101 dataset.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[5]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Frédéric Jurie,et al.  Modeling spatial layout with fisher vectors for image categorization , 2011, 2011 International Conference on Computer Vision.

[7]  Nicu Sebe,et al.  A modified vector of locally aggregated descriptors approach for fast video classification , 2016, Multimedia Tools and Applications.

[8]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[9]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[10]  Andrew Zisserman,et al.  All About VLAD , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Nicu Sebe,et al.  Histograms of Motion Gradients for real-time video classification , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[12]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Nicu Sebe,et al.  Realtime Video Classification using Dense HOF/HOG , 2014, ICMR.

[14]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[15]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[16]  Andrew Zisserman,et al.  Three things everyone should know to improve object retrieval , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[18]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[19]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[21]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[23]  Limin Wang,et al.  Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics , 2014, ECCV.

[24]  Mubarak Shah,et al.  Classifying web videos using a global video descriptor , 2013, Machine Vision and Applications.

[25]  Nicu Sebe,et al.  Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off , 2015, International Journal of Multimedia Information Retrieval.

[26]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[27]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.