Mining Semantics-Preserving Attention for Group Activity Recognition

In this paper, we propose a Semantics-Preserving Teacher-Student (SPTS) model for group activity recognition in videos, which aims to mine the semantics-preserving attention to automatically seek the key people and discard the misleading people. Conventional methods usually aggregate the features extracted from individual persons by pooling operations, which cannot fully explore the contextual information for group activity recognition. To address this, our SPTS networks first learn a Teacher Network in semantic domain, which classifies the word of group activity based on the words of individual actions. Then we carefully design a Student Network in vision domain, which recognizes the group activity according to the input videos, and enforce the Student Network to mimic the Teacher Network during the learning process. In this way, we allocate semantics-preserving attention to different people, which adequately explores the contextual information of different people and requires no extra labelled data. Experimental results on two widely used benchmarks for group activity recognition clearly show the superior performance of our method in comparisons with the state-of-the-arts.

[1]  Qiuqi Ruan,et al.  Action Recognition Using Local Consistent Group Sparse Coding with Spatio-Temporal Structure , 2016, ACM Multimedia.

[2]  Hanqing Lu,et al.  Spatio-Temporal Triangular-Chain CRF for Activity Recognition , 2015, ACM Multimedia.

[3]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Xiangjian He,et al.  Recognizing Human Activity in Still Images by Integrating Group-Based Contextual Cues , 2015, ACM Multimedia.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[7]  John K. Tsotsos,et al.  Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[8]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[9]  Xu Zhao,et al.  Context-Associative Hierarchical Memory Model for Human Activity Recognition and Prediction , 2017, IEEE Transactions on Multimedia.

[10]  Ivan Marsic,et al.  Region-based Activity Recognition Using Conditional GAN , 2017, ACM Multimedia.

[11]  Xin Li,et al.  SBGAR: Semantics Based Group Activity Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Kate Ching-Ju Lin,et al.  Location-Independent WiFi Action Recognition via Vision-based Methods , 2016, ACM Multimedia.

[13]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Francesc Moreno-Noguer,et al.  3D CNNs on Distance Matrices for Human Action Recognition , 2017, ACM Multimedia.

[16]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[17]  Xiao-Ping Zhang,et al.  A Hierarchical Spatio-Temporal Model for Human Activity Recognition , 2017, IEEE Transactions on Multimedia.

[18]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[19]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Zhaozheng Yin,et al.  Human Activity Recognition Using Wearable Sensors by Deep Convolutional Neural Networks , 2015, ACM Multimedia.

[21]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Yansong Tang,et al.  Action recognition in RGB-D egocentric videos , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[24]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[25]  Wang Yan,et al.  Visual recognition by counting instances: A multi-instance cardinality potential kernel , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Li Fei-Fei,et al.  Recurrent Attention Models for Depth-Based Person Identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Rita Cucchiara,et al.  Paying More Attention to Saliency , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[28]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bingbing Ni,et al.  Recurrent Modeling of Interaction Context for Collective Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Heng Tao Shen,et al.  Attention-based LSTM with Semantic Consistency for Videos Captioning , 2016, ACM Multimedia.

[32]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[33]  Dapeng Tao,et al.  Tensor Manifold Discriminant Projections for Acceleration-Based Human Activity Recognition , 2016, IEEE Transactions on Multimedia.

[34]  Song-Chun Zhu,et al.  CERN: Confidence-Energy Recurrent Network for Group Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Juergen Gall,et al.  Structural Recurrent Neural Network (SRNN) for Group Activity Analysis , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[38]  Wei-Shi Zheng,et al.  Latent embeddings for collective activity recognition , 2017, 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[39]  Silvio Savarese,et al.  Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[41]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[42]  Jiwen Lu,et al.  Attention-Aware Deep Reinforcement Learning for Video Face Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[44]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[45]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[46]  Yansong Tang,et al.  Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[48]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[49]  Dongqing Zhang,et al.  Neural Aggregation Network for Video Face Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).