Hierarchical and Spatio-Temporal Sparse Representation for Human Action Recognition

In this paper, we present a novel two-layer video representation for human action recognition employing hierarchical group sparse encoding technique and spatio-temporal structure. In the first layer, a new sparse encoding method named locally consistent group sparse coding (LCGSC) is proposed to make full use of motion and appearance information of local features. LCGSC method not only encodes global layouts of features within the same video-level groups, but also captures local correlations between them, which obtains expressive sparse representations of video sequences. Meanwhile, two kinds of efficient location estimation models, namely an absolute location model and a relative location model, are developed to incorporate spatio-temporal structure into LCGSC representations. In the second layer, action-level group is established, where a hierarchical LCGSC encoding scheme is applied to describe videos at different levels of abstractions. On the one hand, the new layer captures higher order dependency between video sequences; on the other hand, it takes label information into consideration to improve discrimination of videos’ representations. The superiorities of our hierarchical framework are demonstrated on several challenging datasets.

[1]  Peng Dai,et al.  Group Interaction Analysis in Dynamic Context$^{\ast}$ , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[2]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[4]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Qi Tian,et al.  Action Recognition Using Spatial-Temporal Context , 2010, 2010 20th International Conference on Pattern Recognition.

[6]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Tieniu Tan,et al.  Group encoding of local features in image classification , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[9]  Xiaodong Yang,et al.  Action Recognition Using Super Sparse Coding Vector with Spatio-temporal Awareness , 2014, ECCV.

[10]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[12]  Changyin Sun,et al.  Action Recognition Using Nonnegative Action Component Representation and Sparse Basis Selection , 2014, IEEE Transactions on Image Processing.

[13]  Handong Zhao,et al.  Learning Fast Low-Rank Projection for Image Classification , 2016, IEEE Transactions on Image Processing.

[14]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[18]  Samy Bengio,et al.  Group Sparse Coding , 2009, NIPS.

[19]  Xiaoqin Zhang,et al.  Adaptive learning codebook for action recognition , 2011, Pattern Recognit. Lett..

[20]  Haibin Ling,et al.  Modeling Geometric-Temporal Context With Directional Pyramid Co-Occurrence for Action Recognition , 2014, IEEE Transactions on Image Processing.

[21]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[22]  Tanaya Guha,et al.  Learning Sparse Representations for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Hyung Jin Chang,et al.  Robust action recognition using local motion and group sparsity , 2014, Pattern Recognit..

[24]  Chunheng Wang,et al.  Attribute Regularization Based Human Action Recognition , 2013, IEEE Transactions on Information Forensics and Security.

[25]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[26]  Yanjun Qi,et al.  Unsupervised Feature Learning by Deep Sparse Coding , 2013, SDM.

[27]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Jian Yang,et al.  Sparseness Analysis in the Pretraining of Deep Neural Networks , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Hao Wang,et al.  Hierarchical Dynamic Parsing and Encoding for Action Recognition , 2016, ECCV.

[30]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Samy Bengio,et al.  Automatic analysis of multimodal group actions in meetings , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Juergen Gall,et al.  A BoW-equivalent Recurrent Neural Network for Action Recognition , 2015, BMVC.

[33]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Hassan Foroosh,et al.  Exploring Sparseness and Self-Similarity for Action Recognition , 2015, IEEE Transactions on Image Processing.

[35]  Ling Shao,et al.  Combining appearance and structural features for human action recognition , 2013, Neurocomputing.

[36]  Tieniu Tan,et al.  Feature Coding in Image Classification: A Comprehensive Study , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Amir Roshan Zamir,et al.  Action Recognition in Realistic Sports Videos , 2014 .

[39]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[40]  Hong Liu,et al.  A novel hierarchical Bag-of-Words model for compact action representation , 2016, Neurocomputing.

[41]  Nuno Vasconcelos,et al.  Recognizing Activities via Bag of Words for Attribute Dynamics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[43]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[44]  Yang Yang,et al.  Learning semantic visual vocabularies using diffusion distance , 2009, CVPR.

[45]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[46]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[48]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Tong Zhang,et al.  Improved Local Coordinate Coding using Local Tangents , 2010, ICML.

[50]  Tieniu Tan,et al.  Salient coding for image classification , 2011, CVPR 2011.

[51]  James A. Reggia,et al.  Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[52]  Chunheng Wang,et al.  Action Recognition Using Context-Constrained Linear Coding , 2012, IEEE Signal Processing Letters.

[53]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[54]  Ming Shao,et al.  Sparse Canonical Temporal Alignment with Deep Tensor Decomposition for Action Recognition. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[55]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[56]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[57]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Qiuqi Ruan,et al.  Action Recognition Using Local Consistent Group Sparse Coding with Spatio-Temporal Structure , 2016, ACM Multimedia.

[59]  Cor J. Veenman,et al.  Visual Word Ambiguity , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[61]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[62]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.