stagNet: An Attentive Semantic RNN for Group Activity and Individual Action Recognition

In real life, group activity recognition plays a significant and fundamental role in a variety of applications, e.g. sports video analysis, abnormal behavior detection, and intelligent surveillance. In a complex dynamic scene, a crucial yet challenging issue is how to better model the spatio-temporal contextual information and inter-person relationship. In this paper, we present a novel attentive semantic recurrent neural network (RNN), namely, stagNet, for understanding group activities and individual actions in videos, by combining the spatio-temporal attention mechanism and semantic graph modeling. Specifically, a structured semantic graph is explicitly modeled to express the spatial contextual content of the whole scene, which is further incorporated with the temporal factor through structural-RNN. By virtue of the “factor sharing” and “message passing” mechanisms, our stagNet is capable of extracting discriminative and informative spatio-temporal representations and capturing inter-person relationships. Moreover, we adopt a spatio-temporal attention model to focus on key persons/frames for improved recognition performance. Besides, a body-region attention and a global-part feature pooling strategy are devised for individual action recognition. In experiments, four widely-used public datasets are adopted for performance evaluation, and the extensive results demonstrate the superiority and effectiveness of our method.

[1]  Wei Xu,et al.  Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[5]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[6]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Bingbing Ni,et al.  Binary Coding for Partial Action Analysis with Limited Observation Ratios , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Bohyung Han,et al.  Multi-agent Event Detection: Localization and Role Assignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Song-Chun Zhu,et al.  CERN: Confidence-Energy Recurrent Network for Group Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[16]  Zhenhua Wang,et al.  Bilinear Programming for Human Activity Recognition with Unknown MRF Graphs , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[20]  Yi Yang,et al.  Multi-Class Active Learning by Uncertainty Sampling with Diversity Maximization , 2015, International Journal of Computer Vision.

[21]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Jiebo Luo,et al.  Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks , 2018, MMSports@MM.

[25]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Xin Li,et al.  SBGAR: Semantics Based Group Activity Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[29]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[30]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Bingbing Ni,et al.  Recurrent Modeling of Interaction Context for Collective Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[33]  Greg Mori,et al.  Hierarchical Relational Networks for Group Activity Recognition and Retrieval , 2018, ECCV.

[34]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[35]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[36]  Jake K. Aggarwal,et al.  Stochastic representation and recognition of high-level group activities: Describing structural uncertainties in human activities , 2009, CVPR Workshops.

[37]  Larry S. Davis,et al.  Combining Per-frame and Per-track Cues for Multi-person Action Recognition , 2012, ECCV.

[38]  Yuting Zhang,et al.  Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yoshua Bengio,et al.  Globally Trained Handwritten Word Recognizer Using Spatial Representation, Convolutional Neural Networks, and Hidden Markov Models , 1993, NIPS.

[40]  Bingbing Ni,et al.  Zero-Shot Action Recognition with Error-Correcting Output Codes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[43]  Xin Li,et al.  ReHAR: Robust and Efficient Human Activity Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[44]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[45]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[46]  Yi Yang,et al.  Watching a Small Portion could be as Good as Watching All: Towards Efficient Video Classification , 2018, IJCAI.

[47]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[48]  Mohamed R. Amer,et al.  HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos , 2014, ECCV.

[49]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yunhong Wang,et al.  Online Cross-Modal Scene Retrieval by Binary Representation and Semantic Graph , 2017, ACM Multimedia.

[53]  Yi Yang,et al.  Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yi Yang,et al.  Bidirectional Multirate Reconstruction for Temporal Modeling in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[58]  Yanxi Liu,et al.  Tracking Sports Players with Context-Conditioned Motion Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Luc Van Gool,et al.  stagNet: An Attentive Semantic RNN for Group Activity Recognition , 2018, ECCV.

[60]  Wang Yan,et al.  Visual recognition by counting instances: A multi-instance cardinality potential kernel , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Xiaoxiao Li,et al.  Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[62]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  James J. Little,et al.  Learning to Track and Identify Players from Broadcast Sports Videos , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[66]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[68]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[69]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[70]  Silvio Savarese,et al.  Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[72]  Greg Mori,et al.  Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization , 2013, NIPS.

[73]  Alan L. Yuille,et al.  Learning Deep Structured Models , 2014, ICML.

[74]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[75]  Mathias Niepert,et al.  Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[76]  Juergen Gall,et al.  Structural Recurrent Neural Network (SRNN) for Group Activity Analysis , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[77]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[78]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[80]  Mohamed R. Amer,et al.  Monte Carlo Tree Search for Scheduling Activity Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[81]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.