Learning latent spatio-temporal compositional model for human action recognition

Action recognition is an important problem in multimedia understanding. This paper addresses this problem by building an expressive compositional action model. We model one action instance in the video with an ensemble of spatio-temporal compositions: a number of discrete temporal anchor frames, each of which is further decomposed to a layout of deformable parts. In this way, our model can identify a Spatio-Temporal And-Or Graph (STAOG) to represent the latent structure of actions \emph{e.g.} triple jumping, swinging and high jumping. The STAOG model comprises four layers: (i) a batch of leaf-nodes in bottom for detecting various action parts within video patches; (ii) the or-nodes over bottom, i.e. switch variables to activate their children leaf-nodes for structural variability; (iii) the and-nodes within an anchor frame for verifying spatial composition; and (iv) the root-node at top for aggregating scores over temporal anchor frames. Moreover, the contextual interactions are defined between leaf-nodes in both spatial and temporal domains. For model training, we develop a novel weakly supervised learning algorithm which iteratively determines the structural configuration (e.g. the production of leaf-nodes associated with the or-nodes) along with the optimization of multi-layer parameters. By fully exploiting spatio-temporal compositions and interactions, our approach handles well large intra-class action variance (\emph{e.g.} different views, individual appearances, spatio-temporal structures). The experimental results on the challenging databases demonstrate superior performance of our approach over other methods.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  Yi Yang,et al.  Action recognition by exploring data distribution and feature correlation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[4]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Svetha Venkatesh,et al.  Activity recognition and abnormality detection with the switching hidden semi-Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[7]  Tsuhan Chen,et al.  Spatio-Temporal Phrases for Activity Recognition , 2012, ECCV.

[8]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[9]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[10]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[11]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Benjamin Z. Yao,et al.  Learning deformable action templates from cluttered videos , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[16]  Jake Porway,et al.  A stochastic graph grammar for compositional object representation and recognition , 2009, Pattern Recognit..

[17]  James F. Allen,et al.  Actions and Events in Interval Temporal Logic , 1994, J. Log. Comput..

[18]  Gang Yu,et al.  Real-time human action search using random forest based hough voting , 2011, ACM Multimedia.

[19]  Li Li,et al.  Semantic event representation and recognition using syntactic attribute graph grammar , 2009, Pattern Recognit. Lett..

[20]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[21]  Sheng Tang,et al.  Exploring probabilistic localized video representation for human action recognition , 2012, Multimedia Tools and Applications.

[22]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[23]  Yale Song,et al.  Multi-view latent variable discriminative models for action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[25]  Liang Lin,et al.  Dynamical And-Or Graph Learning for Object Shape Modeling and Detection , 2012, NIPS.

[26]  Hongyang Chao,et al.  Discovering Video Shot Categories by Unsupervised Stochastic Graph Partition , 2013, IEEE Transactions on Multimedia.

[27]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[28]  Jian-Huang Lai,et al.  Learning contour-fragment-based shape model with And-Or tree representation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[30]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[31]  Zhu Li,et al.  Real-time human action recognition by luminance field trajectory analysis , 2008, ACM Multimedia.

[32]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Nicu Sebe,et al.  Knowledge adaptation for ad hoc multimedia event detection with few exemplars , 2012, ACM Multimedia.

[36]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[38]  Benjamin Z. Yao,et al.  Unsupervised learning of event AND-OR grammar and semantics from video , 2011, 2011 International Conference on Computer Vision.

[39]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[42]  Rama Chellappa,et al.  A Constrained Probabilistic Petri Net Framework for Human Activity Detection in Video* , 2008, IEEE Transactions on Multimedia.

[43]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[44]  Alexander G. Hauptmann,et al.  Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[45]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Ming Yang,et al.  Detecting video events based on action recognition in complex scenes using spatio-temporal descriptor , 2009, ACM Multimedia.