论文信息 - Compositional Structure Learning for Action Understanding

Compositional Structure Learning for Action Understanding

The focus of the action understanding literature has predominately been classification, how- ever, there are many applications demanding richer action understanding such as mobile robotics and video search, with solutions to classification, localization and detection. In this paper, we propose a compositional model that leverages a new mid-level representation called compositional trajectories and a locally articulated spatiotemporal deformable parts model (LALSDPM) for fully action understanding. Our methods is advantageous in capturing the variable structure of dynamic human activity over a long range. First, the compositional trajectories capture long-ranging, frequently co-occurring groups of trajectories in space time and represent them in discriminative hierarchies, where human motion is largely separated from camera motion; second, LASTDPM learns a structured model with multi-layer deformable parts to capture multiple levels of articulated motion. We implement our methods and demonstrate state of the art performance on all three problems: action detection, localization, and recognition.

[1] Jitendra Malik,et al. Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[2] Sanja Fidler,et al. Towards Scalable Representations of Object Categories: Learning a Hierarchy of Parts , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Christopher Joseph Pal,et al. Activity recognition using the velocity histories of tracked keypoints , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[4] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Alan L. Yuille,et al. An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Chenliang Xu,et al. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Alexander G. Hauptmann,et al. Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[8] Ran Xu,et al. Combining Skeletal Pose with Local Motion for Human Activity Recognition , 2012, AMDO.

[9] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[10] Cordelia Schmid,et al. Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11] James W. Davis,et al. The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[12] Cordelia Schmid,et al. Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[13] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[14] Nazli Ikizler-Cinbis,et al. Action Recognition and Localization by Hierarchical Space-Time Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[15] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[16] Tal Hassner,et al. Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[17] Greg Mori,et al. Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[18] Luc Van Gool,et al. Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[19] Wei Chen,et al. Actionness Ranking with Lattice Conditional Ordinal Random Fields , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[21] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Aude Billard,et al. A survey of Tactile Human-Robot Interactions , 2010, Robotics Auton. Syst..

[23] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[24] G. Johansson. Visual perception of biological motion and a model for its analysis , 1973 .

[25] Kate Saenko,et al. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[26] Patrick Bouthemy,et al. Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Jason J. Corso,et al. Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Mubarak Shah,et al. Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[29] Serge J. Belongie,et al. Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[30] Ying Wu,et al. Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31] Mubarak Shah,et al. Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Ronen Basri,et al. Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Benjamin Z. Yao,et al. Learning deformable action templates from cluttered videos , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[34] Yang Wang,et al. Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[35] Iasonas Kokkinos,et al. Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[37] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[38] Richard P. Wildes,et al. Efficient action spotting based on a spacetime oriented structure representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39] Martial Hebert,et al. Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[40] Juan Carlos Niebles,et al. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[41] Mubarak Shah,et al. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.