Action recognition with temporal scale-invariant deep learning framework

Recognizing actions according to video features is an important problem in a wide scope of applications. In this paper, we propose a temporal scale-invariant deep learning framework for action recognition, which is robust to the change of action speed. Specifically, a video is firstly split into several sub-action clips and a keyframe is selected from each sub-action clip. The spatial and motion features of the keyframe are extracted separately by two Convolutional Neural Networks (CNN) and combined in the convolutional fusion layer for learning the relationship between the features. Then, Long Short Term Memory (LSTM) networks are applied to the fused features to formulate long-term temporal clues. Finally, the action prediction scores of the LSTM network are combined by linear weighted summation. Extensive experiments are conducted on two popular and challenging benchmarks, namely, the UCF-101 and the HMDB51 Human Actions. On both benchmarks, our framework achieves superior results over the state-of-the-art methods by 93.7% on UCF-101 and 69.5% on HMDB51, respectively.

[1]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[2]  干宗良,et al.  Action Recognition from a Different View , 2013 .

[3]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[4]  Zheng Wang,et al.  Zero-Shot Person Re-identification via Cross-View Consistency , 2016, IEEE Transactions on Multimedia.

[5]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[6]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[7]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[8]  BoyerEdmond,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011 .

[9]  Zhang Zhang,et al.  Visual Human Action Recognition: History, Status and Prospects , 2016 .

[10]  Jiwen Lu,et al.  Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Paulo Cortez,et al.  Automatic visual detection of human behavior: A review from 2000 to 2014 , 2015, Expert Syst. Appl..

[12]  Nasser Kehtarnavaz,et al.  Action Recognition from Depth Sequences Using Depth Motion Maps-Based Local Binary Patterns , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[13]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..