Discovering discriminative action parts from mid-level video representations

We describe a mid-level approach for action recognition. From an input video, we extract salient spatio-temporal structures by forming clusters of trajectories that serve as candidates for the parts of an action. The assembly of these clusters into an action class is governed by a graphical model that incorporates appearance and motion constraints for the individual parts and pairwise constraints for the spatio-temporal dependencies among them. During training, we estimate the model parameters discriminatively. During classification, we efficiently match the model to a video using discrete optimization. We validate the model's classification ability in standard benchmark datasets and illustrate its potential to support a fine-grained analysis that not only gives a label to a video, but also identifies and localizes its constituent parts.

[1]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[2]  Pietro Perona,et al.  Unsupervised Learning of Models for Recognition , 2000, ECCV.

[3]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[5]  Christoph Schnörr,et al.  Subgraph Matching with Semidefinite Programming , 2003, Electron. Notes Discret. Math..

[6]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[7]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8]  Martial Hebert,et al.  A spectral technique for correspondence problems using pairwise constraints , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  Jianbo Shi,et al.  Balanced Graph Matching , 2006, NIPS.

[10]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  David A. Forsyth,et al.  Searching Video for Complex Activities with Finite State Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Stefano Soatto,et al.  Localizing Objects with Smart Dictionaries , 2008, ECCV.

[17]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[18]  Vladimir Kolmogorov,et al.  Feature Correspondence Via Graph Matching: Models and Global Optimization , 2008, ECCV.

[19]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Martial Hebert,et al.  Trajectons: Action recognition through the motion analysis of tracked features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[21]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[23]  Matthew B. Blaschko,et al.  Simultaneous Object Detection and Ranking with Weak Supervision , 2010, NIPS.

[24]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[25]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[26]  James M. Rehg,et al.  Temporal causality for the analysis of visual events , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[27]  William Brendel,et al.  Activities as Time Series of Human Postures , 2010, ECCV.

[28]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[29]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[31]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[32]  Michael G. Rabbat,et al.  GANC: Greedy Agglomerative Normalized Cut , 2011, ArXiv.

[33]  Mubarak Shah,et al.  Action recognition in videos acquired by a moving camera using motion decomposition of Lagrangian particle trajectories , 2011, 2011 International Conference on Computer Vision.

[34]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[35]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..