Spatio-temporal Aware Non-negative Component Representation for Action Recognition

This paper presents a novel mid-level representation for action recognition, named spatio-temporal aware non-negative component representation (STANNCR). The proposed STANNCR is based on action component and incorporates the spatial-temporal information. We first introduce a spatial-temporal distribution vector (STDV) to model the distributions of local feature locations in a compact and discriminative manner. Then we employ non-negative matrix factorization (NMF) to learn the action components and encode the video samples. The action component considers the correlations of visual words, which effectively bridge the sematic gap in action recognition. To incorporate the spatial-temporal cues for final representation, the STDV is used as the part of graph regularization for NMF. The fusion of spatial-temporal information makes the STANNCR more discriminative, and our fusion manner is more compact than traditional method of concatenating vectors. The proposed approach is extensively evaluated on three public datasets. The experimental results demonstrate the effectiveness of STANNCR for action recognition.

[1]  Mubarak Shah,et al.  A probabilistic representation for efficient large scale visual recognition tasks , 2011, CVPR 2011.

[2]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Wei Liang,et al.  Discriminative human action recognition in the learned hierarchical manifold space , 2010, Image Vis. Comput..

[4]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Frédéric Jurie,et al.  Modeling spatial layout with fisher vectors for image categorization , 2011, 2011 International Conference on Computer Vision.

[7]  Nuno Vasconcelos,et al.  Recognizing Activities via Bag of Words for Attribute Dynamics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[9]  Chong-Wah Ngo,et al.  Trajectory-Based Modeling of Human Actions with Motion Reference Points , 2012, ECCV.

[10]  Mubarak Shah,et al.  Classifying web videos using a global video descriptor , 2013, Machine Vision and Applications.

[11]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[12]  Xiaojun Wu,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[14]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[15]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[16]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, CVPR.

[17]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Qiuqi Ruan,et al.  Local non-negative component representation for human action recognition , 2014 .

[19]  Yi Yang,et al.  Space-Time Robust Representation for Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Wenbin Li,et al.  Graph regularized discriminative non-negative matrix factorization for face recognition , 2013, Multimedia Tools and Applications.

[21]  Lei Wang,et al.  In defense of soft-assignment coding , 2011, 2011 International Conference on Computer Vision.

[22]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[23]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[24]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Gang Hua,et al.  Scene Aligned Pooling for Complex Video Recognition , 2012, ECCV.

[27]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[28]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[29]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  乔宇 Motionlets: Mid-Level 3D Parts for Human Motion Recognition , 2013 .

[31]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[32]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[33]  Feng Shi,et al.  Sampling Strategies for Real-Time Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[35]  Ying Wu,et al.  Action recognition with multiscale spatio-temporal contexts , 2011, CVPR 2011.

[36]  Changyin Sun,et al.  Action Recognition Using Nonnegative Action Component Representation and Sparse Basis Selection , 2014, IEEE Transactions on Image Processing.

[37]  Iasonas Kokkinos,et al.  Discovering discriminative action parts from mid-level video representations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Xiaodong Yang,et al.  Action Recognition Using Super Sparse Coding Vector with Spatio-temporal Awareness , 2014, ECCV.

[40]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[41]  Shuicheng Yan,et al.  STAP: Spatial-Temporal Attention-Aware Pooling for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.