Structured sparse representation for human action recognition

Video understanding is an important goal of several computer vision problems. To achieve this goal, a video is decomposed into a set of key components and the interactions between the components are modeled. Human action recognition is a challenging example of video understanding in the field of computer vision. Modeling a vocabulary of local image features in a bag of visual words (BoW) is a common approach to extract the components of an action video. Since in a video recognition task, there is no direct mapping from the raw features to class label, higher level visual descriptors and indeed, more accurate dictionaries are required. Therefore, in order to extract intrinsic shape bases and to consider temporal structure of an action, in this paper we take the advantages of group sparse coding methods. In our proposed BoW method each video is represented as a histogram of the coefficients obtained from group sparse coding. The main contribution of this study is to explore the geometry of action components via structured sparse coefficients of visual words in a real-time manner. In comparison with the conventional BoW models, our proposed approach has other advantages including much less quantization error, higher level feature representation which leads to reduction in model parameters and memory complexity while considering temporal structure. We evaluate our method on standard human action datasets including KTH, Weismann, UCF-sports and UCF50 human action datasets. The experimental results are significantly improved in comparison with previously presented results methods.

[1]  Fei-Fei Li,et al.  Action Recognition with Exemplar Based 2.5D Graph Matching , 2012, ECCV.

[2]  Feng Shi,et al.  Sampling Strategies for Real-Time Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Sven J. Dickinson,et al.  Object Categorization: The Evolution of Object Categorization and the Challenge of Image Abstraction , 2009 .

[4]  Nicolas Le Roux,et al.  Ask the locals: Multi-way local pooling for image recognition , 2011, 2011 International Conference on Computer Vision.

[5]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[8]  Arjan Kuijper,et al.  Human action recognition based on skeleton splitting , 2013, Expert Syst. Appl..

[9]  Ajit Rajwade,et al.  Block and Group Regularized Sparse Modeling for Dictionary Learning , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  A. Haar Zur Theorie der orthogonalen Funktionensysteme , 1910 .

[11]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[12]  Yang Wang,et al.  Semi-Latent Dirichlet Allocation: A Hierarchical Model for Human Action Recognition , 2007, Workshop on Human Motion.

[13]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[14]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Frédéric Jurie,et al.  Improving object classification using semantic attributes , 2010, BMVC.

[17]  Rodolphe Jenatton,et al.  Structured Sparsity-Inducing Norms : Statistical and Algorithmic Properties with Applications to Neuroimaging. (Normes Parcimonieuses Structurées : Propriétés Statistiques et Algorithmiques avec Applications à l'Imagerie Cérébrale) , 2011 .

[18]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[20]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[21]  Liang Lin,et al.  Trajectory parsing by cluster sampling in spatio-temporal graph , 2009, CVPR.

[22]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[24]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[25]  Mubarak Shah,et al.  Classifying web videos using a global video descriptor , 2013, Machine Vision and Applications.

[26]  Sven J. Dickinson,et al.  Object Categorization: Computer and Human Vision Perspectives , 2009 .

[27]  Joseph F. Murray,et al.  Visual Recognition and Inference Using Dynamic Overcomplete Sparse Learning , 2007, Neural Computation.

[28]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[29]  E. Candès,et al.  New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities , 2004 .

[30]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[31]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[33]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[34]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[36]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[37]  Luc Van Gool,et al.  Coupled Action Recognition and Pose Estimation from Multiple Views , 2012, International Journal of Computer Vision.

[38]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[39]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[40]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[41]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[43]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[44]  Ling Shao,et al.  A Wavelet Based Local Descriptor for Human Action Recognition , 2010, BMVC.

[45]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[46]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[47]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[48]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[49]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Limin Wang,et al.  Motionlets: Mid-level 3D Parts for Human Motion Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Samy Bengio,et al.  Group Sparse Coding , 2009, NIPS.

[52]  Tal Hassner,et al.  Motion Interchange Patterns for Action Recognition in Unconstrained Videos , 2012, ECCV.

[53]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[54]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.