Multi-state Discriminative Video Segment Selection for Complex Event Classification

Recognizing long range complex events composed of a sequence of primitive actions is a challenging task, as videos may not be consistently aligned in time with respect to the primitive actions across video examples. Moreover, there can exist arbitrary long intervals between, and within the execution of primitive actions. We propose a novel multistate segment selection algorithm, which pools features from the discriminative segments of a video. We present an efficient linear programming based solution, and introduce novel linear constraints to enforce temporal ordering between segments from different states. We also propose a regularized version of our algorithm, which automatically determines the optimum number of segments to be selected in each video. Furthermore, we present a new, provably faster \(O(N\log N)\) algorithm for the single state K-segment selection problem. Our results are validated on the Composite Cooking Activity dataset, containing videos of cooking recipes.

[1]  Daniel B. Neill,et al.  Fast subset scan for spatial pattern detection , 2012 .

[2]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  WangYang,et al.  Hidden Part Models for Human Action Recognition , 2011 .

[4]  Svetha Venkatesh,et al.  Activity recognition and abnormality detection with the switching hidden semi-Markov model , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[6]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[7]  Sebastian Nowozin,et al.  Discriminative Subsequence Mining for Action Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[8]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Alan Fern,et al.  Probabilistic event logic for interval-based event recognition , 2011, CVPR 2011.

[14]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[17]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Sangmin Oh,et al.  Compositional Models for Video Event Detection: A Multiple Kernel Learning Latent Variable Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Andrew Gilbert,et al.  Fast realistic multi-action recognition using mined dense spatio-temporal features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[25]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[26]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Ramakant Nevatia,et al.  Learning 3D action models from a few 2D videos for view invariant action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.