Action bank: A high-level representation of activity in video

Activity recognition in video is dominated by low- and mid-level features, and while demonstrably capable, by nature, these features carry little semantic meaning. Inspired by the recent object bank approach to image representation, we present Action Bank, a new high-level representation of video. Action bank is comprised of many individual action detectors sampled broadly in semantic space as well as viewpoint space. Our representation is constructed to be semantically rich and even when paired with simple linear SVM classifiers is capable of highly discriminative performance. We have tested action bank on four major activity recognition benchmarks. In all cases, our performance is better than the state of the art, namely 98.2% on KTH (better by 3.3%), 95.0% on UCF Sports (better by 3.7%), 57.9% on UCF50 (baseline is 47.9%), and 26.9% on HMDB51 (baseline is 23.2%). Furthermore, when we analyze the classifiers, we find strong transfer of semantics from the constituent action detectors to the bank classifier.

[1]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Richard P. Wildes,et al.  Efficient action spotting based on a spacetime oriented structure representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[7]  S. Gong,et al.  Recognising action as clouds of space-time interest points , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Mubarak Shah,et al.  Chaotic Invariants for Human Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[9]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[10]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[11]  Konstantinos G. Derpanis,et al.  Three-dimensional nth derivative of Gaussian separable steerable filters , 2005, IEEE International Conference on Image Processing 2005.

[12]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[13]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[14]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[16]  David A. Forsyth,et al.  Automatic Annotation of Everyday Movements , 2003, NIPS.

[17]  Alex Pentland,et al.  Coding, Analysis, Interpretation, and Recognition of Facial Expressions , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[19]  A. J. Mistlin,et al.  Visual analysis of body movements by neurones in the temporal cortex of the macaque monkey: A preliminary report , 1985, Behavioural Brain Research.

[20]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[21]  Shaogang Gong,et al.  Recognising action as clouds of space-time interest points , 2009, CVPR.

[22]  Martin A. Giese,et al.  Neural model for the recognition of biological motion , 2010 .

[23]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[24]  Andrew Gilbert,et al.  Fast realistic multi-action recognition using mined dense spatio-temporal features , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  R. Nelson,et al.  Low level recognition of human motion (or how to get your man without finding his body parts) , 1994, Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects.

[27]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  David Elliott,et al.  In the Wild , 2010 .

[29]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[30]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[31]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[37]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[38]  Michael J. Black,et al.  Secrets of optical flow estimation and their principles , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[40]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.