Multi-Label Multi-Class Action Recognition With Deep Spatio-Temporal Layers Based on Temporal Gaussian Mixtures

Current action recognition studies enjoy the benefits of two neural network branches, spatial and temporal. This work aims to extend the previous work by introducing a fusion of spatial and temporal branches to provide superior action recognition capability toward multi-label multi-class classification problems. In this paper, we propose three fusion models with different fusion strategies. We first build several efficient temporal Gaussian mixture (TGM) layers to form spatial and temporal branches to learn a set of features. In addition to these branches, we introduce a new deep spatio-temporal branch consisting of a series of TGM layers to learn the features that emerged from the existing branches. Each branch produces a temporal-aware feature that assists the model in understanding the underlying action in a video. To verify the performance of our proposed models, we performed extensive experiments using the well-known MultiTHUMOS benchmarking dataset. The results demonstrate the importance of our proposed deep fusion mechanism, contributing to the overall score while keeping the number of parameters small.

[1]  Andreas Kunz,et al.  Res3ATN - Deep 3D Residual Attention Network for Hand Gesture Recognition in Videos , 2019, 2019 International Conference on 3D Vision (3DV).

[2]  J. Arunnehru,et al.  Human Action Recognition using 3D Convolutional Neural Networks with 3D Motion Cuboids in Surveillance Videos , 2018 .

[3]  Hao Yang,et al.  Time-Asymmetric 3d Convolutional Neural Networks for Action Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[4]  Xinyu Zhang,et al.  Two-Stream Convolution Neural Network with Video-stream for Action Recognition , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[5]  Guangcan Liu,et al.  Multi-scale Spatiotemporal Information Fusion Network for Video Action Recognition , 2018, 2018 IEEE Visual Communications and Image Processing (VCIP).

[6]  Xiao Liu,et al.  StNet: Local and Global Spatial-Temporal Modeling for Action Recognition , 2018, AAAI.

[7]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[8]  Ping Shi,et al.  Multi-stream I3D Network for Fine-grained Action Recognition , 2018, 2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC).

[9]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[10]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[11]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Masaki Aono,et al.  Action Recognition by Composite Deep Learning Architecture I3D-DenseLSTM , 2019, 2019 International Conference of Advanced Informatics: Concepts, Theory and Applications (ICAICTA).

[13]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[14]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[15]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[16]  Sung Wook Baik,et al.  Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features , 2018, IEEE Access.

[17]  Alexandros Stergiou,et al.  Spatio-Temporal FAST 3D Convolutions for Human Action Recognition , 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).

[18]  Marjan Gholamrezaii,et al.  Human Activity Recognition Using 2D Convolutional Neural Networks , 2019, 2019 27th Iranian Conference on Electrical Engineering (ICEE).

[19]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[20]  Yadong Mu,et al.  Two-Stream Video Classification with Cross-Modality Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[21]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Ghassan Al-Regib,et al.  TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[23]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[25]  Janusz Konrad,et al.  Action Recognition Using Sparse Representation on Covariance Manifolds of Optical Flow , 2010, 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance.

[26]  Qiang Wu,et al.  Coupled Patch Alignment for Matching Cross-View Gaits , 2019, IEEE Transactions on Image Processing.

[27]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Deva Ramanan,et al.  Predictive-Corrective Networks for Action Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Wen Gao,et al.  Action Recognition in Broadcast Tennis Video , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[34]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[35]  Peng Zhang,et al.  A general tensor representation framework for cross-view gait recognition , 2019, Pattern Recognit..

[36]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Michael S. Ryoo,et al.  Temporal Gaussian Mixture Layer for Videos , 2018, ICML.

[38]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[39]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Qing Lei,et al.  A Comprehensive Survey of Vision-Based Human Action Recognition Methods , 2019, Sensors.