Action recognition via structured codebook construction

Bag-of-words models have been widely used to obtain the global representation for action recognition. However, these models ignored the structure information, such as the spatial and temporal contextual information for action representation. In this paper, we propose a novel structured codebook construction method to encode spatial and temporal contextual information among local features for video representation. Given a set of training videos, our method first extracts local motion and appearance features. Next, we encode the spatial and temporal contextual information among local features by constructing correlation matrices for local spatio-temporal features. Then, we discover the common patterns of movements to construct the structured codebook. After that, actions can be represented by a set of sparse coefficients with respect to the structured codebook. Finally, a simple linear SVM classifier is applied to predict the action class based on the action representation. Our method has two main advantages compared to traditional methods. First, our method automatically discovers the mid-level common patterns of movements that capture rich spatial and temporal contextual information. Second, our method is robust to unwanted background local features mainly because most unwanted background local features cannot be sparsely represented by the common patterns and they are treated as residual errors that are not encoded into the action representation. We evaluate the proposed method on two popular benchmarks: KTH action dataset and UCF sports dataset Experimental results demonstrate the advantages of our structured codebook construction. (C) 2014 Elsevier B.V. All rights reserved.

[1]  Tae-Kyun Kim,et al.  Learning Motion Categories using both Semantic and Structural Information , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Martial Hebert,et al.  Representing Pairwise Spatial and Temporal Relations for Action Recognition , 2010, ECCV.

[3]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[4]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Ling Shao,et al.  Action recognition using Correlogram of Body Poses and spectral regression , 2011, 2011 18th IEEE International Conference on Image Processing.

[6]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Ling Shao,et al.  A set of co-occurrence matrices on the intrinsic manifold of human silhouettes for action recognition , 2010, CIVR '10.

[9]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[10]  Ling Shao,et al.  Feature detector and descriptor evaluation in human action recognition , 2010, CIVR '10.

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[12]  Greg Mori,et al.  Action recognition by learning mid-level motion features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Yun Fu,et al.  Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition , 2010, ACCV.

[16]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[17]  Ming Yang,et al.  Mining discriminative co-occurrence patterns for visual recognition , 2011, CVPR 2011.

[18]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[19]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[20]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[21]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[22]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.