Improving Human Action Recognition with Two-Stream 3D Convolutional Neural Network

Human action recognition has became a hot topic in recent years because it opens a wide range of applications such as video surveillance, assisted living, entertainment. Recently, advanced techniques relying on convolutional neural networks produced impressive improvement compared to traditional handcrafted features based techniques. Besides, literature researches also showed that the use of different streams of data will help to increase recognition performance. This paper proposes a method that exploits both RGB and optical flow for human action recognition. Specifically, we deploy a two stream convolutional neural network that takes RGB and optical flow computed from RGB stream as inputs. Each stream has architecture of an existing 3D convolutional neural network (C3D) which has been shown to be compact but efficient for the task of action recognition from video. Each stream works independently then is combined by early fusion or late fusion to output the recognition results. We show that the proposed two-stream 3D convolutional neural network (2stream C3D) outperforms one stream C3D on two benchmark datasets UCF101 (from 82.37% to 88.79%) and HMDB51 (from 48.43 % to 62.54%).

[1]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[5]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[6]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[7]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[9]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[11]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[14]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[16]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Guangchun Cheng,et al.  Advances in Human Action Recognition: A Survey , 2015, ArXiv.