Going deeper with two-stream ConvNets for action recognition in video surveillance

Abstract Learning by deep convolutional networks have shown an outstanding effectiveness in a variety of vision based classification tasks, and for which, large datasets are the prerequisites to guarantee its high performance. But in many realistic circumstances, using a massive quantity of training samples to achieve more sophisticated analysis is hard to be fulfilled always, such as human action recognition in videos, and the resulting problem of data deficiency, especially for the labeled data, would critically limit the deeper model structure as a promising solution due to its high risk of overfitting. Additionally, in lacking of high modeling capacity constrained by of model depth, the high-level visual cues like object interaction, scene context and pose variations concurrent with human action also could become the extrinsic and intrinsic challenges for the traditional deep convolutional networks. For the limitations above, in this paper, we proposed a strategy of dataset remodeling by transferring parameters of ResNet-101 layers trained on the ImageNet dataset to initialize learning model and adopt an augmented data variation approach to overcome the overfitting challenge of sample deficiency. For model structure improvement, a novel deeper two-stream ConvNets has been designed for the learning of action complexity. With a dis-order strategy of training/testing video sets, the proposed model and learning strategy are able to collaboratively achieve a significant improvement of action recognition. Experiments on two challenging datasets UCF101 and KTH have verified a superior performance in comparison with other state-of-the-art methods.

[1]  Dacheng Tao,et al.  Pose-invariant face recognition with homography-based normalization , 2017, Pattern Recognit..

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[4]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[7]  Ke Lu,et al.  $p$-Laplacian Regularized Sparse Coding for Human Activity Recognition , 2016, IEEE Transactions on Industrial Electronics.

[8]  Fei Gao,et al.  Deep Multimodal Distance Metric Learning Using Click Constraints for Image Ranking , 2017, IEEE Transactions on Cybernetics.

[9]  Dacheng Tao,et al.  Robust Face Recognition via Multimodal Deep Face Representation , 2015, IEEE Transactions on Multimedia.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jianping Fan,et al.  iPrivacy: Image Privacy Protection by Identifying Sensitive Objects via Deep Multi-Task Learning , 2017, IEEE Transactions on Information Forensics and Security.

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[18]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[19]  Luc Van Gool,et al.  Two-Stream SR-CNNs for Action Recognition in Videos , 2016, BMVC.

[20]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[21]  Jun Yu,et al.  Click Prediction for Web Image Reranking Using Multimodal Sparse Coding , 2014, IEEE Transactions on Image Processing.

[22]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[23]  Jonghyun Choi,et al.  Multi-Directional Multi-Level Dual-Cross Patterns for Robust Face Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[25]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[27]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[28]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[29]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[31]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[32]  Jun Yu,et al.  Exploiting Click Constraints and Multi-view Features for Image Re-ranking , 2014, IEEE Transactions on Multimedia.

[33]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Ke Lu,et al.  Multiview Hessian regularized logistic regression for action recognition , 2015, Signal Process..

[35]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[36]  Y. Wang,et al.  Large-scale paralleled sparse principal component analysis , 2014, Multimedia Tools and Applications.

[37]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[40]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[43]  Dacheng Tao,et al.  Multi-Task Pose-Invariant Face Recognition , 2015, IEEE Transactions on Image Processing.

[44]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[45]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.