论文信息 - Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation

Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. The contribution of MGMA is three-fold: First, by devising a new spatiotemporal separable attention mechanism, it can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. Second, through designing a novel multi-group structure, it can capture multi-attention rendered spatiotemporal features better. Finally, our MGMA module is lightweight and flexible yet effective, so that can be easily embedded into any 3D Convolutional Neural Network (3D-CNN) architecture. We embed multiple MGMA modules into 3D-CNN to train an end-to-end, RGB-only model and evaluate on four popular benchmarks: UCF101 and HMDB51, Something-Something V1 and V2. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts. Our code is available at https://github.com/zhenglab/mgma.

[1] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Abhinav Gupta,et al. Videos as Space-Time Region Graphs , 2018, ECCV.

[3] M. Goodale,et al. Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[4] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[5] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Thomas Brox,et al. ECO: Efficient Convolutional Network for Online Video Understanding , 2018, ECCV.

[7] Yi Yang,et al. DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[9] Alan Yuille,et al. Grouped Spatial-Temporal Aggregation for Efficient Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Hailin Jin,et al. Learning Video Representations From Correspondence Proposals , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[12] Ruslan Salakhutdinov,et al. Action Recognition using Visual Attention , 2015, NIPS 2015.

[13] Pichao Wang,et al. Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[14] Errui Ding,et al. Multi-Attention Multi-Class Constraint for Fine-grained Image Recognition , 2018, ECCV.

[15] Xiao Liu,et al. Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] William T. Freeman,et al. SpeedNet: Learning the Speediness in Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Deva Ramanan,et al. Attentional Pooling for Action Recognition , 2017, NIPS.

[19] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[22] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[23] Cordelia Schmid,et al. Action recognition by dense trajectories , 2011, CVPR 2011.

[24] Dahua Lin,et al. Trajectory Convolution for Action Recognition , 2018, NeurIPS.

[25] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Davide Modolo,et al. Action Recognition With Spatial-Temporal Discriminative Filter Banks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .

[29] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[30] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Heng Wang,et al. Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Yi Yang,et al. You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[36] Wenmin Wang,et al. Learning Object-Centric Transformation for Video Prediction , 2017, ACM Multimedia.

[37] Xiangyu Zhang,et al. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[38] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[40] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Mehrtash Tafazzoli Harandi,et al. Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[42] Leonidas J. Guibas,et al. Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Chen Sun,et al. Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames , 2016, ECCV.

[44] Xiangyu Zhang,et al. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Ngai-Man Cheung,et al. Deep Adaptive Temporal Pooling for Activity Recognition , 2018, ACM Multimedia.

[46] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[47] L. Stankov. Attention and Intelligence. , 1983 .

[48] Chong-Wah Ngo,et al. Learning Spatio-Temporal Representation With Local and Global Diffusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[50] Francesc Moreno-Noguer,et al. 3D CNNs on Distance Matrices for Human Action Recognition , 2017, ACM Multimedia.

[51] S. Yantis,et al. Selective visual attention and perceptual coherence , 2006, Trends in Cognitive Sciences.

[52] Yun Fu,et al. Tell Me Where to Look: Guided Attention Inference Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[55] Dahua Lin,et al. Recognize Actions by Disentangling Components of Dynamics , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56] Asim Kadav,et al. Attend and Interact: Higher-Order Object Interactions for Video Understanding , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57] Luc Van Gool,et al. Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[58] PorikliFatih,et al. Going deeper into action recognition , 2017 .

[59] Xiaogang Wang,et al. Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Mohan S. Kankanhalli,et al. Attention Transfer from Web Images for Video Recognition , 2017, ACM Multimedia.

[61] Cordelia Schmid,et al. MARS: Motion-Augmented RGB Stream for Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Wei Wu,et al. STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[64] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[65] Bo Zhao,et al. Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos , 2018, ArXiv.

[66] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[67] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[68] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[69] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[70] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71] Tao Mei,et al. Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[72] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74] Di Wu,et al. Recent advances in video-based human action recognition using deep learning: A review , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[75] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[76] Ronald A. Rensink. The Dynamic Representation of Scenes , 2000 .

[77] Bolei Zhou,et al. Reasoning About Human-Object Interactions Through Dual Attention Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[78] Limin Wang,et al. Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.