Learning spatiotemporal graphs of human activities

Complex human activities occurring in videos can be defined in terms of temporal configurations of primitive actions. Prior work typically hand-picks the primitives, their total number, and temporal relations (e.g., allow only followed-by), and then only estimates their relative significance for activity recognition. We advance prior work by learning what activity parts and their spatiotemporal relations should be captured to represent the activity, and how relevant they are for enabling efficient inference in realistic videos. We represent videos by spatiotemporal graphs, where nodes correspond to multiscale video segments, and edges capture their hierarchical, temporal, and spatial relationships. Access to video segments is provided by our new, multiscale segmenter. Given a set of training spatiotemporal graphs, we learn their archetype graph, and pdf's associated with model nodes and edges. The model adaptively learns from data relevant video segments and their relations, addressing the “what” and “how.” Inference and learning are formulated within the same framework - that of a robust, least-squares optimization - which is invariant to arbitrary permutations of nodes in spatiotemporal graphs. The model is used for parsing new videos in terms of detecting and localizing relevant activity parts. We out-perform the state of the art on benchmark Olympic and UT human-interaction datasets, under a favorable complexity-vs.-accuracy trade-off.

[1]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[2]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, CVPR.

[3]  Ze-Nian Li BEYOND ACTIONS : DISCRIMINATIVE MODELS FOR CONTEXTUAL GROUP ACTIVITIES , 2010 .

[4]  Narendra Ahuja,et al.  Unsupervised Category Modeling, Recognition, and Segmentation in Images , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Rama Chellappa,et al.  PADS: A Probabilistic Activity Detection Framework for Video Data , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[8]  Shaogang Gong,et al.  Beyond Tracking: Modelling Activity and Understanding Behaviour , 2006, International Journal of Computer Vision.

[9]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[10]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Irfan A. Essa,et al.  Structure from Statistics - Unsupervised Activity Analysis using Suffix Trees , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[12]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Andrea Torsello,et al.  An importance sampling approach to learning structural representations of shape , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Liang Lin,et al.  Trajectory parsing by cluster sampling in spatio-temporal graph , 2009, CVPR.

[15]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Narendra Ahuja,et al.  Extracting Texels in 2.1D Natural Textures , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Arkadi Nemirovski,et al.  Sums of random symmetric matrices and quadratic optimization under orthogonality constraints , 2007, Math. Program..

[18]  Nir Friedman,et al.  Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks , 2004, Machine Learning.

[19]  Edwin R. Hancock,et al.  Learning shape-classes using a mixture of tree-unions , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Horst Bunke,et al.  Graph Clustering Using the Weighted Minimum Common Supergraph , 2003, GbRPR.

[21]  Janusz Konrad,et al.  Space-time image sequence analysis: object tunnels and occlusion volumes , 2006, IEEE Transactions on Image Processing.

[22]  Larry S. Davis,et al.  Event Modeling and Recognition Using Markov Logic Networks , 2008, ECCV.

[23]  David Salesin,et al.  Video object annotation, navigation, and composition , 2008, UIST '08.

[24]  Jitendra Malik,et al.  Image and video segmentation: the normalized cut framework , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[25]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.