论文信息 - Modeling video evolution for action recognition

Modeling video evolution for action recognition

In this paper we present a method to capture video-wide temporal information for action recognition. We postulate that a function capable of ordering the frames of a video temporally (based on the appearance) captures well the evolution of the appearance within the video. We learn such ranking functions per video via a ranking machine and use the parameters of these as a new video representation. The proposed method is easy to interpret and implement, fast to compute and effective in recognizing a wide variety of actions. We perform a large number of evaluations on datasets for generic action recognition (Hollywood2 and HMDB51), fine-grained actions (MPII- cooking activities) and gestures (Chalearn). Results show that the proposed method brings an absolute improvement of 7-10%, while being compatible with and complementary to further improvements in appearance and local motion based methods.

[1] Ling Shao,et al. Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Mubarak Shah,et al. Recognizing Complex Events Using Large Margin Joint Low-Level Event Model , 2012, ECCV.

[3] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4] Yang Wang,et al. Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Patrick Bouthemy,et al. Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Bernt Schiele,et al. Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[7] Patrick Bouthemy,et al. Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Fei-Fei Li,et al. Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Ramakant Nevatia,et al. ACTIVE: Activity Concept Transitions in Video Event Classification , 2013, 2013 IEEE International Conference on Computer Vision.

[10] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[11] Florent Perronnin,et al. Large-scale image retrieval with compressed Fisher vectors , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12] Jianxin Wu,et al. Towards Good Practices for Action Video Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Alexei Sourin,et al. Function representation in geometric modeling: concepts, implementation and applications , 1995, The Visual Computer.

[14] Tie-Yan Liu,et al. Learning to Rank for Information Retrieval , 2011 .

[15] Luc Van Gool,et al. Gesture Recognition Portfolios for Personalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Cordelia Schmid,et al. Actom sequence models for efficient action detection , 2011, CVPR 2011.

[17] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18] André Elisseeff,et al. Stability and Generalization , 2002, J. Mach. Learn. Res..

[19] Cordelia Schmid,et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[20] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[21] Bingbing Ni,et al. Pipelining Localized Semantic Features for Fine-Grained Action Recognition , 2014, ECCV.

[22] C. Schmid,et al. Recognizing activities with cluster-trees of tracklets , 2012, BMVC.

[23] Cees Snoek,et al. What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Iasonas Kokkinos,et al. Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Yann LeCun,et al. Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[26] Quoc V. Le,et al. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[27] Hanqing Lu,et al. Fusing multi-modal features for gesture recognition , 2013, ICMI '13.

[28] Cordelia Schmid,et al. Actions in context , 2009, CVPR.

[29] Jake K. Aggarwal,et al. Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30] Bernhard Schölkopf,et al. A tutorial on support vector regression , 2004, Stat. Comput..

[31] Sergio Escalera,et al. Multi-modal gesture recognition challenge 2013: dataset and results , 2013, ICMI '13.

[32] Andrew Zisserman,et al. Domain-Adaptive Discriminative One-Shot Learning of Gestures , 2014, ECCV.

[33] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[34] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[35] Andrew Zisserman,et al. Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[36] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37] Andrew W. Fitzgibbon,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[38] Yu Qiao,et al. Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[39] Alexei A. Efros,et al. Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[40] Yale Song,et al. Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Andrea Vedaldi,et al. Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[42] Bernt Schiele,et al. A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Andrew Zisserman,et al. Improving Human Action Recognition Using Score Distribution and Ranking , 2014, ACCV.

[44] Thorsten Joachims,et al. Training linear SVMs in linear time , 2006, KDD '06.