Human action recognition based on multi-mode spatial-temporal feature fusion

Motion representation plays a vital role in human action recognition. In recent few years, the application of deep learning in action recognition has become popular. However, there are great challenges in extracting accurate motion features. In this study, a novel feature representation that combines multi-scale spatial-temporal feature is proposed. This descriptor contains spatial-temporal information for three mode, which are extracted from three input channels of RGB images, RGB difference images and binary XOR images. Specifically, a network that consist of convolutional neural network (CNN) and long short-term memory (LSTM) extract spatial-temporal feature from RGB images and RGB difference images respectively. On the other hand, global motion information is extracted from binary XOR images using another separate CNN network. Then, we combine this features from the three channels as a new video feature representation. Finally, an extreme learning machine (ELM) is adopted as classifier. Experimental results on UCF-50 dataset show the superiority of the proposed method.

[1]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[2]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Lin Sun,et al.  Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[7]  Amaury Lendasse,et al.  High-Performance Extreme Learning Machines: A Complete Toolbox for Big Data Applications , 2015, IEEE Access.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10]  Guang-Bin Huang,et al.  Extreme learning machine: a new learning scheme of feedforward neural networks , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[11]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Lin Sun,et al.  Lattice Long Short-Term Memory for Human Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Long Qin,et al.  A C-LSTM Neural Network for Human Activity Recognition Using Wearables , 2018, 2018 International Symposium in Sensing and Instrumentation in IoT Era (ISSI).

[16]  Andrew Gilbert,et al.  Capturing relative motion and finding modes for action recognition in the wild , 2014, Comput. Vis. Image Underst..

[17]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[18]  Mubarak Shah,et al.  Recognizing 50 human action categories of web videos , 2012, Machine Vision and Applications.

[19]  Pierre Wellner,et al.  Adaptive Thresholding for the DigitalDesk , 1993 .

[20]  Sinisa Todorovic Human Activities as Stochastic Kronecker Graphs , 2012, ECCV.

[21]  Lei Zhang,et al.  Realistic human action recognition: When CNNS meet LDS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[25]  Luc Van Gool,et al.  Hough Transform and 3D SURF for Robust Three Dimensional Classification , 2010, ECCV.

[26]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[27]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[28]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[30]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[31]  Sung Wook Baik,et al.  Activity Recognition Using Temporal Optical Flow Convolutional Features and Multilayer LSTM , 2019, IEEE Transactions on Industrial Electronics.

[32]  Chaomin Luo,et al.  Generic object recognition based on the fusion of 2D and 3D SIFT descriptors , 2015, 2015 18th International Conference on Information Fusion (Fusion).