Action Detection by Implicit Intentional Motion Clustering

Explicitly using human detection and pose estimation has found limited success in action recognition problems. This may be due to the complexity in the articulated motion human exhibit. Yet, we know that action requires an actor and intention. This paper hence seeks to understand the spatiotemporal properties of intentional movement and how to capture such intentional movement without relying on challenging human detection and tracking. We conduct a quantitative analysis of intentional movement, and our findings motivate a new approach for implicit intentional movement extraction that is based on spatiotemporal trajectory clustering by leveraging the properties of intentional movement. The intentional movement clusters are then used as action proposals for detection. Our results on three action detection benchmarks indicate the relevance of focusing on intentional movement for action detection, our method significantly outperforms the state of the art on the challenging MSR-II multi-action video benchmark.

[1]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[3]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[5]  Benjamin Z. Yao,et al.  Learning deformable action templates from cluttered videos , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[7]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[8]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[9]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[11]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[12]  Wei Chen,et al.  Actionness Ranking with Lattice Conditional Ordinal Random Fields , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[15]  Richard P. Wildes,et al.  Efficient action spotting based on a spacetime oriented structure representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[17]  Larry S. Davis,et al.  Multi-agent event recognition in structured scenarios , 2011, CVPR 2011.

[18]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[19]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[22]  Tieniu Tan,et al.  A survey on visual surveillance of object motion and behaviors , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[23]  William Brendel,et al.  Activities as Time Series of Human Postures , 2010, ECCV.

[24]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[26]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[27]  D. Davidson Actions, Reasons, And Causes , 1980 .

[28]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[30]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[32]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[33]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Ran Xu,et al.  Combining Skeletal Pose with Local Motion for Human Activity Recognition , 2012, AMDO.

[37]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Takeo Kanade,et al.  Background Subtraction for Freely Moving Cameras , 2009, 2009 IEEE 12th International Conference on Computer Vision.