MoFAP: A Multi-level Representation for Action Recognition

This paper proposes a multi-level video representation by stacking the activations of motion features, atoms, and phrases (MoFAP). Motion features refer to those low-level local descriptors, while motion atoms and phrases can be viewed as mid-level “temporal parts”. Motion atom is defined as an atomic part of action, and captures the motion information of video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms defined with an AND/OR structure. It further enhances the discriminative capacity of motion atoms by incorporating temporal structure in a longer temporal scale. Specifically, we first design a discriminative clustering method to automatically discover a set of representative motion atoms. Then, we mine effective motion phrases with high discriminative and representative capacity in a bottom-up manner. Based on these basic units of motion features, atoms, and phrases, we construct a MoFAP network by stacking them layer by layer. This MoFAP network enables us to extract the effective representation of video data from different levels and scales. The separate representations from motion features, motion atoms, and motion phrases are concatenated as a whole one, called Activation of MoFAP. The effectiveness of this representation is demonstrated on four challenging datasets: Olympic Sports, UCF50, HMDB51, and UCF101. Experimental results show that our representation achieves the state-of-the-art performance on these datasets.

[1]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[2]  Song-Chun Zhu,et al.  Image Parsing with Stochastic Scene Grammar , 2011, NIPS.

[3]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[4]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[7]  Limin Wang,et al.  Mining Motion Atoms and Phrases for Complex Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  David J. Kriegman,et al.  Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[10]  Long Zhu,et al.  Rapid Inference on a Novel AND/OR graph for Object Detection, Segmentation and Parsing , 2007, NIPS.

[11]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[12]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Limin Wang,et al.  Multi-view Super Vector for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  James F. O'Brien,et al.  Computational Studies of Human Motion , 2006 .

[17]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Heng Wang LEAR-INRIA submission for the THUMOS workshop , 2013 .

[19]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[21]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Philip H. S. Torr,et al.  Learning discriminative space-time actions from weakly labelled videos , 2012, BMVC.

[25]  Cordelia Schmid,et al.  Activity representation with motion hierarchies , 2013, International Journal of Computer Vision.

[26]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[27]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[28]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Limin Wang,et al.  A Comparative Study of Encoding, Pooling and Normalization Methods for Action Recognition , 2012, ACCV.

[30]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[31]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[32]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[33]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Song-Chun Zhu,et al.  Learning AND-OR Templates for Object Recognition and Detection , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[38]  乔宇 Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[39]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[40]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[41]  Zhuowen Tu,et al.  Action Recognition with Actons , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Limin Wang,et al.  Latent Hierarchical Model of Temporal Structure for Complex Activity Classification , 2014, IEEE Transactions on Image Processing.

[43]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[45]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[46]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[47]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[48]  David A. Forsyth,et al.  Computational Studies of Human Motion: Part 1, Tracking and Motion Synthesis , 2005, Found. Trends Comput. Graph. Vis..

[49]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[50]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[51]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[52]  Jianxin Wu,et al.  Towards Good Practices for Action Video Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[54]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[55]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[56]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[57]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[58]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[59]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Larry S. Davis,et al.  Representing Videos Using Mid-level Discriminative Patches , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.