Multi-level action detection via learning latent structure

Detecting actions in videos is still a demanding task due to large intra-class variation caused by varying pose, motion and scales. Conventional approaches use a Bag-of-Words model in the form of space-time motion feature pooling followed by learning a classifier. However, since the informative body parts motion only appear in specific regions of the body, these methods have limited capability. In this paper, we seek to learn a model of the interaction among regions of interest via a graph structure. We first discover several space-time video segments representing persistent moving body parts observed sparsely in video. Then, via learning the hidden graph structure (a subset of the graph), we identify both spatial and temporal relations between the subsets of these segments. In order to seize the more discriminative motion patterns and handle different interactions between body parts from simple to composite action, we present a multi-level action model representation. Consequently, for action classification, the classifier learned through each action model labels the test video based on the action model that gives the highest probability score. Experiments on challenging datasets, such as MSR II and UCF-Sports including complex motions and dynamic backgrounds, demonstrate the effectiveness of the proposed approach that outperforms state-of-the-art methods in this context.

[1]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[2]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Zhuowen Tu,et al.  Action Recognition with Actons , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Nazli Ikizler-Cinbis,et al.  Action Recognition and Localization by Hierarchical Space-Time Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Alberto Del Bimbo,et al.  Adaptive Structured Pooling for Action Recognition , 2014, BMVC.

[10]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[13]  IEEE conference on computer vision and pattern recognition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[14]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Dima Damen,et al.  Recognizing linked events: Searching the space of feasible explanations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[18]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.