SAST: Learning Semantic Action-Aware Spatial-Temporal Features for Efficient Action Recognition

The state-of-the-arts in action recognition are suffering from three challenges: (1) How to model spatial transformations of action since it is always geometric variation over time in videos. (2) How to develop the semantic action-aware temporal features from one video with a large proportion of irrelevant frames to the labeled action class, which hurt the final performance. (3) The action recognition speed of most existing models is too slow to be applied to actual scenes. In this paper, to address these three challenges, we propose a novel CNN-based action recognition method called SAST including three important modules, which can effectively learn semantic action-aware spatial-temporal features with a faster speed. Firstly, to learn action-aware spatial features (spatial transformations), we design a weight shared 2D Deformable Convolutional network named 2DDC with deformable convolutions whose receptive fields can be adaptively adjusted according to the complex geometric structure of actions. Then, we propose a light Temporal Attention model called TA to develop the action-aware temporal features that are discriminative for the labeled action category. Finally, we apply an effective 3D network to learn the temporal context between frames for building the final video-level representation. To improve the efficiency, we only utilize the raw RGB rather than optical flow and RGB as the input to our model. Experimental results on four challenging video recognition datasets Kinetics-400, Something-Something-V1, UCF101 and HMDB51 demonstrate that our proposed method can not only achieve comparable performances but be $10\times$ to $50\times$ faster than most of state-of-the-art action recognition methods.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[3]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[4]  Dahua Lin,et al.  Trajectory Convolution for Action Recognition , 2018, NeurIPS.

[5]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Lorenzo Torresani,et al.  SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[8]  Muhammad Haroon Yousaf,et al.  DA-VLAD: Discriminative Action Vector of Locally Aggregated Descriptors for Action Recognition , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[9]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[12]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[15]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[16]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Luc Van Gool,et al.  DynamoNet: Dynamic Action and Motion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Luc Van Gool,et al.  Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification , 2017, ArXiv.

[20]  Shuicheng Yan,et al.  Multi-Fiber Networks for Video Recognition , 2018, ECCV.

[21]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[25]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[26]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[29]  Lorenzo Torresani,et al.  DistInit: Learning Video Representations Without a Single Labeled Video , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[32]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Gaurav Sharma,et al.  AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[36]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[37]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[38]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[39]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[40]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Lin Li,et al.  End-to-end Video-level Representation Learning for Action Recognition , 2017, 2018 24th International Conference on Pattern Recognition (ICPR).

[43]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[44]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[46]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[47]  Yali Wang,et al.  PA3D: Pose-Action 3D Machine for Video Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[49]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[50]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Shih-Fu Chang,et al.  ConvNet Architecture Search for Spatiotemporal Feature Learning , 2017, ArXiv.

[52]  Weiyao Lin,et al.  Action Recognition with Coarse-to-Fine Deep Feature Integration and Asynchronous Fusion , 2018, AAAI.