Ensembling 3D CNN Framework for Video Recognition

Video-based behavior recognition is a challenging research topic. The three dimensional convolution neural network (3D CNN) is effectively adopted to capture features from videos directly. 3D CNN is extended by two-dimensional convolution neural network, in which a time dimension is added. 3D CNN is better than two-dimensional convolution network in expressing effective motion information, and it has certain advantages. In order to make better use of the valuable features extracted from the original video information, only stacked RGB frame data sets can be used as the input of network. Ensembling 3D CNN framework for video recognition is proposed in the paper. Firstly, the pre-training model of Sports-1M is initialized firstly, and a 3D convolution neural network based on multi-level feature fusion is constructed. . The final high-dimensional feature combination is obtained by fusing multiple convolution features. Then 3D convolutional neural network based on ensemble learning is proposed to increase motion information, enrich motion features and enhance the robustness of single feature representation. Three incomplete training data sets are obtained by Bagging algorithm. To get different networks, three data sets are employed to train three 3D convolution neural networks respectively, and the output of the three networks is integrated. The output features of the three networks are input into the SVM classifier through the Stacking algorithm and the final results are obtained. The integration effects of different ensemble methods are compared. The experimental results show that the method of this work can improve recognition accuracy on UCF-101 data set effectively.

[1]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Chen Sun,et al.  VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[7]  Christian Wolf,et al.  Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks , 2010, ICANN.

[8]  Christian Wolf,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Zhiyuan Tang,et al.  Recurrent neural network training with dark knowledge transfer , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Pichao Wang,et al.  Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[11]  Luc Van Gool,et al.  Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification , 2017, ArXiv.

[12]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[13]  Yong Pei,et al.  Multilevel Depth and Image Fusion for Human Activity Detection , 2013, IEEE Transactions on Cybernetics.

[14]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Congcong Wen,et al.  A novel spatiotemporal convolutional long short-term neural network for air pollution prediction. , 2019, The Science of the total environment.

[17]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[18]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[19]  Amit K. Roy-Chowdhury,et al.  A poisson process model for activity forecasting , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[20]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.