Action Recognition Using Local Consistent Group Sparse Coding with Spatio-Temporal Structure

This paper presents a novel and efficient framework for human action recognition through integrating the local consistent group sparse representation with spatio-temporal structure of each video sequence. We firstly propose a sparse encoding scheme named local consistent group sparse coding (LCGSC) to generate the sparse representation of each video sequence. The novel encoding scheme takes global structural information of features belonging to one group into consideration as well as the local correlations between similar features. In order to incorporate the spatio-temporal structures, an average location (AL) model is proposed to describe the distribution of each visual word along the spatio-temporal coordinates on the basis of the obtained sparse codes. Eventually, each video sequence is jointly represented by the sparse representation and the spatio-temporal layouts which fully model its motion, appearance and spatio-temporal information. Our framework is computationally efficient and achieves comparable performance on the challenging datasets with state-of-the-art methods.

[1]  Hong Liu,et al.  A novel hierarchical Bag-of-Words model for compact action representation , 2016, Neurocomputing.

[2]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[3]  Ling Shao,et al.  Combining appearance and structural features for human action recognition , 2013, Neurocomputing.

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[6]  James A. Reggia,et al.  Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[7]  Chunheng Wang,et al.  Action Recognition Using Context-Constrained Linear Coding , 2012, IEEE Signal Processing Letters.

[8]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[9]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[10]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Nuno Vasconcelos,et al.  Recognizing Activities via Bag of Words for Attribute Dynamics , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[15]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[16]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Thomas S. Huang,et al.  Image Classification Using Super-Vector Coding of Local Image Descriptors , 2010, ECCV.

[18]  Liang-Tien Chia,et al.  Local features are not lonely – Laplacian sparse coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[19]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[23]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Tanaya Guha,et al.  Learning Sparse Representations for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Hyung Jin Chang,et al.  Robust action recognition using local motion and group sparsity , 2014, Pattern Recognit..

[26]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[27]  Samy Bengio,et al.  Group Sparse Coding , 2009, NIPS.

[28]  Haibin Ling,et al.  Modeling Geometric-Temporal Context With Directional Pyramid Co-Occurrence for Action Recognition , 2014, IEEE Transactions on Image Processing.

[29]  Xiaodong Yang,et al.  Action Recognition Using Super Sparse Coding Vector with Spatio-temporal Awareness , 2014, ECCV.

[30]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.