Action recognition by single stream convolutional neural networks: An approach using combined motion and static information

We investigate the problem of automatic action recognition and classification of videos. In this paper, we present a convolutional neural network architecture, which takes both motion and static information as inputs in a single stream. We show that the network is able to treat motion and static information as different feature maps and extract features off them, although stacked together. We trained and tested our network on Youtube dataset. Our network is able to surpass state-of-the-art hand-engineered feature methods. Furthermore, we also studied and compared the effect of providing static information to the network, in the task of action recognition. Our results justify the use of optic flows as the raw information of motion and also show the importance of static information, in the context of action recognition.

[1]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[2]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[5]  Ho Joon Kim,et al.  Human Action Recognition Using a Modified Convolutional Neural Network , 2007, ISNN.

[6]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[7]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[8]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[9]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[10]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[13]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[16]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[17]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[18]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[19]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.