Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. The contribution of MGMA is three-fold: First, by devising a new spatiotemporal separable attention mechanism, it can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. Second, through designing a novel multi-group structure, it can capture multi-attention rendered spatiotemporal features better. Finally, our MGMA module is lightweight and flexible yet effective, so that can be easily embedded into any 3D Convolutional Neural Network (3D-CNN) architecture. We embed multiple MGMA modules into 3D-CNN to train an end-to-end, RGB-only model and evaluate on four popular benchmarks: UCF101 and HMDB51, Something-Something V1 and V2. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts. Our code is available at https://github.com/zhenglab/mgma.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Abhinav Gupta,et al.  Videos as Space-Time Region Graphs , 2018, ECCV.

[3]  M. Goodale,et al.  Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[4]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[5]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Thomas Brox,et al.  ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[7]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Alan Yuille,et al.  Grouped Spatial-Temporal Aggregation for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Hailin Jin,et al.  Learning Video Representations From Correspondence Proposals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[12]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[13]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[14]  Errui Ding,et al.  Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition , 2018, ECCV.

[15]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  William T. Freeman,et al.  SpeedNet: Learning the Speediness in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[19]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[23]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[24]  Dahua Lin,et al.  Trajectory Convolution for Action Recognition , 2018, NeurIPS.

[25]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Davide Modolo,et al.  Action Recognition With Spatial-Temporal Discriminative Filter Banks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[29]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[30]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yi Yang,et al.  You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[36]  Wenmin Wang,et al.  Learning Object-Centric Transformation for Video Prediction , 2017, ACM Multimedia.

[37]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[38]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[40]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[42]  Leonidas J. Guibas,et al.  Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Chen Sun,et al.  Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames , 2016, ECCV.

[44]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Ngai-Man Cheung,et al.  Deep Adaptive Temporal Pooling for Activity Recognition , 2018, ACM Multimedia.

[46]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[47]  L. Stankov Attention and Intelligence. , 1983 .

[48]  Chong-Wah Ngo,et al.  Learning Spatio-Temporal Representation With Local and Global Diffusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[50]  Francesc Moreno-Noguer,et al.  3D CNNs on Distance Matrices for Human Action Recognition , 2017, ACM Multimedia.

[51]  S. Yantis,et al.  Selective visual attention and perceptual coherence , 2006, Trends in Cognitive Sciences.

[52]  Yun Fu,et al.  Tell Me Where to Look: Guided Attention Inference Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[55]  Dahua Lin,et al.  Recognize Actions by Disentangling Components of Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Asim Kadav,et al.  Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Luc Van Gool,et al.  Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[58]  PorikliFatih,et al.  Going deeper into action recognition , 2017 .

[59]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Mohan S. Kankanhalli,et al.  Attention Transfer from Web Images for Video Recognition , 2017, ACM Multimedia.

[61]  Cordelia Schmid,et al.  MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[64]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[65]  Bo Zhao,et al.  Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos , 2018, ArXiv.

[66]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[67]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[68]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[69]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[70]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Tao Mei,et al.  Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[72]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Di Wu,et al.  Recent advances in video-based human action recognition using deep learning: A review , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[75]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[76]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[77]  Bolei Zhou,et al.  Reasoning About Human-Object Interactions Through Dual Attention Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[78]  Limin Wang,et al.  Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.