Integrating local action elements for action analysis

In this paper, we propose a framework for human action analysis from video footage. A video action sequence in our perspective is a dynamic structure of sparse local spatial-temporal patches termed action elements, so the problems of action analysis in video are carried out here based on the set of local characteristics as well as global shape of a prescribed action. We first detect a set of action elements that are the most compact entities of an action, then we extend the idea of Implicit Shape Model to space time, in order to properly integrate the spatial and temporal properties of these action elements. In particular, we consider two different recipes to construct action elements: one is to use a Sparse Bayesian Feature Classifier to choose action elements from all detected Spatial Temporal Interest Points, and is termed discriminative action elements. The other one detects affine invariant local features from the holistic Motion History Images, and picks up action elements according to their compactness scores, and is called generative action elements. Action elements detected from either way are then used to construct a voting space based on their local feature representations as well as their global configuration constraints. Our approach is evaluated in the two main contexts of current human action analysis challenges, action retrieval and action classification. Comprehensive experimental results show that our proposed framework marginally outperforms all existing state-of-the-arts techniques on a range of different datasets.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Yang Wang,et al.  Learning a discriminative hidden part model for human action recognition , 2008, NIPS.

[3]  Vincent Lepetit,et al.  Fast Keypoint Recognition in Ten Lines of Code , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Franziska Meier,et al.  3D Shape Context and Distance Transform for action recognition , 2008, 2008 19th International Conference on Pattern Recognition.

[5]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[6]  R. Basri,et al.  Shape representation and classification using the Poisson equation , 2004, CVPR 2004.

[7]  Yannis Avrithis,et al.  Dense saliency-based spatiotemporal feature points for action recognition , 2009, CVPR.

[8]  Cordelia Schmid,et al.  Learning to Recognize Objects with Little Supervision , 2008, International Journal of Computer Vision.

[9]  R. Venkatesh Babu,et al.  Recognition of human actions using motion history information extracted from the compressed video , 2004, Image Vis. Comput..

[10]  Rémi Ronfard,et al.  Free viewpoint action recognition using motion history volumes , 2006, Comput. Vis. Image Underst..

[11]  Xinghua Sun,et al.  Action recognition via local descriptors and holistic features , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Li Wang,et al.  Implicit Motion-Shape Model: A generic approach for action matching , 2010, 2010 IEEE International Conference on Image Processing.

[14]  Barbara Caputo,et al.  Local velocity-adapted motion events for spatio-temporal recognition , 2007, Comput. Vis. Image Underst..

[15]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[16]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[17]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[18]  Li Wang,et al.  Human Action Recognition and Localization in Video Using Structured Learning of Local Space-Time Features , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[19]  Cordelia Schmid,et al.  A Semi-supervised Learning Approach to Object Recognition with Spatial Integration of Local Features and Segmentation Cues , 2006, Toward Category-Level Object Recognition.

[20]  Pietro Perona,et al.  A Probabilistic Cascade of Detectors for Individual Object Recognition , 2008, ECCV.

[21]  Yong Yu,et al.  Radon Representation-Based Feature Descriptor for Texture Classification , 2009, IEEE Transactions on Image Processing.

[22]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Shuicheng Yan,et al.  Contextual motion field-based distance for video analysis , 2008, The Visual Computer.

[24]  Roberto Cipolla,et al.  Semantic texton forests for image categorization and segmentation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  W. Eric L. Grimson,et al.  Gait analysis for recognition and classification , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[26]  Antonio Torralba,et al.  Describing Visual Scenes Using Transformed Objects and Parts , 2008, International Journal of Computer Vision.

[27]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[28]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Prateek Jain,et al.  Fast image search for learned metrics , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Richard Szeliski,et al.  Modeling the World from Internet Photo Collections , 2008, International Journal of Computer Vision.

[31]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[33]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[34]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[37]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[38]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Jitendra Malik,et al.  Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[40]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[41]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[42]  Dan Roth,et al.  Learning a Sparse Representation for Object Detection , 2002, ECCV.

[43]  Xuelong Li,et al.  General Tensor Discriminant Analysis and Gabor Features for Gait Recognition , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[45]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Krystian Mikolajczyk,et al.  Action recognition with motion-appearance vocabulary forest , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Li Wang,et al.  Human Body Articulation for Action Recognition in Video Sequences , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[49]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[50]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[51]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[53]  Bir Bhanu,et al.  Individual recognition using gait energy image , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[55]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[56]  Li Wang,et al.  Weakly Supervised Action Recognition Using Implicit Shape Models , 2010, 2010 20th International Conference on Pattern Recognition.

[57]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Antonio Torralba,et al.  Small codes and large image databases for recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[60]  Ze-Nian Li,et al.  Successive Convex Matching for Action Detection , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[61]  Michael Isard,et al.  Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[62]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[63]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[64]  Bernt Schiele,et al.  Integrating representative and discriminant models for object category detection , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[65]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[67]  Yang Wang,et al.  Human action recognition from a single clip per action , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.