Human Action Recognition: Learning Sparse Basis Units from Trajectory Subspace

ABSTRACT Human action recognition from image sequences is a challenging issue in the computer vision society. Each action can be decomposed into a few intrinsic parts that concisely model the action based on the architecture of the visual cortex. To recognize human actions, we model a video as a sequence of visual words in which every change in the human pose corresponds to a “word.” This step is followed by sparse coding feature learning with the aim of transforming low-level descriptors of visual words into richer representations of intermediate complexity, called mid-level features, similar to what happens in the primary visual cortex (area V1). In the next stage, due to extracted temporal information, a concise and informative topic from phrase trajectories is introduced that takes into account the advantages of an effective locality-constrained linear coding (LLC) algorithm. The goal of this coding method is to represent input vectors as approximate linear combinations of a small number of “basis vectors,” which are described as high-level features, utilizing the locality constraints to project each descriptor into its local-coordinate system. Recognizing an action leads to matching these active basis sequences within the larger video sequence. Although it is common to represent each video as a histogram of visual words, this type of representation discards the temporal information inherent to actions. To maintain the temporal order of actions in sequence, the hidden Markov model (HMM) is utilized. The main contribution of this study is to elicit mid- and high-level features in conjunction with incorporating the temporal ordering constraints between the basis units in a real-time manner. We evaluate our methodology on the KTH, Weismann, and UCF-sports human action datasets. Experimental results demonstrate that the feature learning in spatial and temporal domains can improve the final action recognition in terms of speed, accuracy, and interpretability compared to the state-of-art methods.

[1]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[2]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Henning Biermann,et al.  Recovering non-rigid 3D shape from image streams , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[5]  Sven J. Dickinson,et al.  Object Categorization: The Evolution of Object Categorization and the Challenge of Image Abstraction , 2009 .

[6]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[7]  Yang Wang,et al.  Semi-Latent Dirichlet Allocation: A Hierarchical Model for Human Action Recognition , 2007, Workshop on Human Motion.

[8]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[10]  Mohiuddin Ahmad,et al.  HMM-based Human Action Recognition Using Multiview Image Sequences , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[11]  Joseph F. Murray,et al.  Visual Recognition and Inference Using Dynamic Overcomplete Sparse Learning , 2007, Neural Computation.

[12]  Sahin Albayrak,et al.  Human action recognition using Lagrangian descriptors , 2012, 2012 IEEE 14th International Workshop on Multimedia Signal Processing (MMSP).

[13]  Takeo Kanade,et al.  Nonrigid Structure from Motion in Trajectory Space , 2008, NIPS.

[14]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[15]  James J. DiCarlo,et al.  How Does the Brain Solve Visual Object Recognition? , 2012, Neuron.

[16]  Václav Hlavác,et al.  Pose primitive based human action recognition in videos or still images , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[19]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[20]  Mubarak Shah,et al.  Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[22]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[24]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[25]  Patrick Pérez,et al.  Cross-View Action Recognition from Temporal Self-similarities , 2008, ECCV.

[26]  Du Tran,et al.  Human Activity Recognition with Metric Learning , 2008, ECCV.

[27]  Liang-Tien Chia,et al.  Motion Context: A New Representation for Human Action Recognition , 2008, ECCV.

[28]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[29]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[30]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[32]  HertzmannAaron,et al.  Nonrigid Structure-from-Motion , 2008 .

[33]  Edmond Boyer,et al.  Action recognition using exemplar-based embedding , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[36]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[37]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[38]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[39]  Eric Horvitz,et al.  Layered representations for human activity recognition , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.