Human Action Recognition by Learning Spatio-Temporal Features With Deep Neural Networks

Human action recognition is one of the fundamental challenges in robotics systems. In this paper, we propose one lightweight action recognition architecture based on deep neural networks just using RGB data. The proposed architecture consists of convolution neural network (CNN), long short-term memory (LSTM) units, and temporal-wise attention model. First, the CNN is used to extract spatial features to distinguish objects from the background with both local and semantic characteristics. Second, two kinds of LSTM networks are performed on the spatial feature maps of different CNN layers (pooling layer and fully-connected layer) to extract temporal motion features. Then, one temporal-wise attention model is designed after the LSTM to learn which parts in which frames are more important. Lastly, a joint optimization module is designed to explore intrinsic relations between two kinds of LSTM features. Experimental results demonstrate the efficiency of the proposed method.

[1]  MengChu Zhou,et al.  A Cooperative Quality-Aware Service Access System for Social Internet of Vehicles , 2018, IEEE Internet of Things Journal.

[2]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[4]  Larry S. Davis,et al.  Action Recognition with Image Based CNN Features , 2015, ArXiv.

[5]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[7]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Baoxin Li,et al.  Hierarchical Attention Network for Action Recognition in Videos , 2016, ArXiv.

[11]  Amir Roshan Zamir,et al.  Action Recognition in Realistic Sports Videos , 2014 .

[12]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[13]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[14]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[15]  Victor C. M. Leung,et al.  Vita: A Crowdsensing-Oriented Mobile Cyber-Physical System , 2013, IEEE Transactions on Emerging Topics in Computing.

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Mohan S. Kankanhalli,et al.  Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition , 2017, IEEE Transactions on Cybernetics.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yu-Kun Lai,et al.  Saliency guided local and global descriptors for effective action recognition , 2016, Computational Visual Media.

[20]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[21]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[24]  Adriana Kovashka,et al.  Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Jun Cheng,et al.  Mobile Cyber Physical Systems: Current Challenges and Future Networking Applications , 2018, IEEE Access.

[26]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[28]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Amit K. Roy-Chowdhury,et al.  Incremental Activity Modeling and Recognition in Streaming Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[31]  Xinyu Wu,et al.  The spatial Laplacian and temporal energy pyramid representation for human action recognition using depth sequences , 2017, Knowl. Based Syst..

[32]  Dapeng Tao,et al.  Skeleton embedded motion body partition for human action recognition using depth sequences , 2018, Signal Process..

[33]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[34]  Cordelia Schmid,et al.  Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Sridha Sridharan,et al.  Two Stream LSTM: A Deep Fusion Framework for Human Action Recognition , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  William Robson Schwartz,et al.  Histograms of Optical Flow Orientation and Magnitude and Entropy to Detect Anomalous Events in Videos , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[37]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[38]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[39]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[41]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[42]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[43]  Victor C. M. Leung,et al.  A Survey on Mobile Social Networks: Applications, Platforms, System Architectures, and Future Research Directions , 2015, IEEE Communications Surveys & Tutorials.

[44]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.