Multi-Level ResNets with Stacked SRUs for Action Recognition

Inspiring by the fact that the enormous breakthrough convolutional networks consistently make in image classification while most existing works are either low efficiency or hard to optimized, we propose multiple level residual networks with stacked simple recurrent units(R-SRU) model trained end-to-end that ResNets learn spatial information from frame appearances and stacked SRUs learn temporal dynamics from video sequences, both deep in spatial and temporal. We investigate the effect of diverse hyper-parameter settings aiming at recommending researchers the better choice of hyper-parameters for using SRUs. Additionally, we compare low-, mid-, high-level features produced by ResNets and combine multi-level features to pass it through SRUs with various time pooling manners after that, experimentally demonstrating the extent of contribution of each level features to action recognition. Specifically, we are the first to apply SRU to distinguish actions. A series of experiments is carried out on two standard benchmarks: HMDB-51 and UCF-101 dataset. Experimental results illustrate that R-SRU outperforms the majority of methods which only take RGB data as input and obtain competitive performances with the state-of-the-art, achieving 51.31% on HMDB-51 and 81.38% on UCF-101.

[1]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[2]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Heng Wang LEAR-INRIA submission for the THUMOS workshop , 2013 .

[4]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[5]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[8]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[12]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[15]  Yu Zhang,et al.  Training RNNs as Fast as CNNs , 2017, EMNLP 2018.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[19]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[20]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[21]  Alex Graves,et al.  Long Short-Term Memory , 2020, Computer Vision.

[22]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[23]  Cordelia Schmid,et al.  Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[24]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[28]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lior Wolf,et al.  RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[30]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[34]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[35]  Piji Li,et al.  What is happening in a still picture? , 2011, The First Asian Conference on Pattern Recognition.

[36]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[37]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[39]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[42]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[43]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[44]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[45]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.