A Multi-View Human Action Recognition System in Limited Data Case using Multi-Stream CNN

In recent years, Convolutional Neural Networks (CNNs) have been extensively used for human action recognition. However, training a convolutional neural network by limited data is a challenging problem. In this paper, a multi-stream 3DCNN structure is proposed for multi-view human action recognition. In this model, a four-stream 3D-CNN with handcrafted features, containing optical flow and gradients in horizontal and vertical directions, is proposed as a solution to improve the recognition performance in the case of limited data. This model combines multi-view four-stream 3D-CNNs from different views. The proposed multi-stream 3D-CNN is applied to IXMAS and NIXMAS multi-view datasets. The experiments illustrate superior results in comparison with state-of-the-art methods. The results show 3.58% improvement in comparison with single stream 3D-CNN architecture using raw video data in IXMAS dataset. However, with more limitations in number of training data in NIXMAS dataset, results show remarkable improvement in comparison with single stream 3D-CNN structure that is 12.6%.

[1]  Jiayi Luo,et al.  Skeleton-based action recognition by part-aware graph convolutional networks , 2019, The Visual Computer.

[2]  Alexandros Iosifidis,et al.  Multi-view human movement recognition based on fuzzy distances and linear discriminant analysis , 2012, Comput. Vis. Image Underst..

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Zhifei Zhang,et al.  Robust coupling in space of sparse codes for multi-view recognition , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[6]  Ling Shao,et al.  Multi-view action recognition using local similarity random forests and sensor fusion , 2013, Pattern Recognit. Lett..

[7]  Ling Shao,et al.  Learning Discriminative Key Poses for Action Recognition , 2013, IEEE Transactions on Cybernetics.

[8]  Ling Shao,et al.  Boosted key-frame selection and correlated pyramidal motion-feature representation for human action recognition , 2013, Pattern Recognit..

[9]  Z. Zivkovic Improved adaptive Gaussian mixture model for background subtraction , 2004, ICPR 2004.

[10]  Tiejun Huang,et al.  Sequential Deep Trajectory Descriptor for Action Recognition With Three-Stream CNN , 2016, IEEE Transactions on Multimedia.

[11]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Ahmet Burak Can,et al.  Combining 2D and 3D deep models for action recognition with depth information , 2018, Signal Image Video Process..

[13]  Isaac Cohen,et al.  Inference of human postures by classification of 3D human body shape , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[14]  Hong-Yuan Mark Liao,et al.  Robust Action Recognition via Borrowing Information Across Video Modalities , 2015, IEEE Transactions on Image Processing.

[15]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[16]  Teddy Ko,et al.  A survey on behavior analysis in video surveillance for homeland security applications , 2008, 2008 37th IEEE Applied Imagery Pattern Recognition Workshop.

[17]  Laurent Mascarilla,et al.  An efficient and sparse approach for large scale human action recognition in videos , 2016, Machine Vision and Applications.

[18]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[19]  Farbod Razzazi,et al.  Multi-stream 3D CNN structure for human action recognition trained by limited data , 2019, IET Comput. Vis..

[20]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Dong Xu,et al.  Action recognition using context and appearance distribution features , 2011, CVPR 2011.

[23]  Mohan M. Trivedi,et al.  Human Pose Estimation and Activity Recognition From Multi-View Videos: Comparative Explorations of Recent Developments , 2012, IEEE Journal of Selected Topics in Signal Processing.

[24]  Bin Yang,et al.  A convolutional neural network combined with aggregate channel feature for face detection , 2015 .

[25]  Ioannis Pitas,et al.  The i3DPost Multi-View and 3D Human Action/Interaction Database , 2009, 2009 Conference for Visual Media Production.

[26]  Maurice Milgram,et al.  A novel approach for recognition of human actions with semi-global features , 2008, Machine Vision and Applications.

[27]  Ling Shao,et al.  Action recognition using Correlogram of Body Poses and spectral regression , 2011, 2011 18th IEEE International Conference on Image Processing.

[28]  Zan Gao,et al.  Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition , 2015, Signal Process..

[29]  Thomas B. Moeslund,et al.  A Local 3-D Motion Descriptor for Multi-View Human Action Recognition from 4-D Spatio-Temporal Interest Points , 2012, IEEE Journal of Selected Topics in Signal Processing.

[30]  Ling Shao,et al.  Histogram of Body Poses and Spectral Regression Discriminant Analysis for Human Action Categorization , 2010, BMVC.

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[33]  Zhang Yi,et al.  Moving object recognition using multi-view three-dimensional convolutional neural networks , 2016, Neural Computing and Applications.

[34]  B. Rosenhahn,et al.  Computation strategies for volume local binary patterns applied to action recognition , 2014, 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[35]  Pascal Fua,et al.  Making Action Recognition Robust to Occlusions and Viewpoint Changes , 2010, ECCV.

[36]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[37]  Anupam Agrawal,et al.  A survey on activity recognition and behavior understanding in video surveillance , 2012, The Visual Computer.

[38]  Xuelong Li,et al.  Detection of Sudden Pedestrian Crossings for Driving Assistance Systems , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[39]  Florian Baumann,et al.  Recognizing human actions using novel space-time volume binary patterns , 2016, Neurocomputing.