A Hierarchical Deep Temporal Model for Group Activity Recognition

In group activity recognition, the temporal dynamics of the whole activity can be inferred based on the dynamics of the individual people representing the activity. We build a deep model to capture these dynamics based on LSTM (long short-term memory) models. To make use of these observations, we present a 2-stage deep temporal model for the group activity recognition problem. In our model, a LSTM model is designed to represent action dynamics of individual people in a sequence and another LSTM model is designed to aggregate person-level information for whole activity understanding. We evaluate our model over two datasets: the Collective Activity Dataset and a new volleyball dataset. Experimental results demonstrate that our proposed model improves group activity recognition performance compared to baseline methods.

[1]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[2]  Bohyung Han,et al.  Multi-agent Event Detection: Localization and Role Assignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Fei-Fei Li,et al.  Social Role Discovery in Human Events , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Thomas B. Moeslund,et al.  Sports Type Classification Using Signature Heatmaps , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Aaron F. Bobick,et al.  Recognizing Planned, Multiperson Action , 2001, Comput. Vis. Image Underst..

[7]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  DarrellTrevor,et al.  Long-Term Recurrent Convolutional Networks for Visual Recognition and Description , 2017 .

[9]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[10]  S. Intille,et al.  Recognizing planned, multi-person action , 2022 .

[11]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[13]  Qiang Ji,et al.  Complex Activity Recognition Using Granger Constrained DBN (GCDBN) in Sports and Surveillance Video , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[15]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[16]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[18]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[19]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[20]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[21]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Larry S. Davis,et al.  Multi-agent event recognition in structured scenarios , 2011, CVPR 2011.

[23]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[24]  Michael Felsberg,et al.  Accurate Scale Estimation for Robust Visual Tracking , 2014, BMVC.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Hao Jiang,et al.  Scale and Rotation Invariant Approach to Tracking Human Body Part Regions in Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[28]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[29]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[30]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[31]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[32]  Raquel Urtasun,et al.  Fully Connected Deep Structured Networks , 2015, ArXiv.

[33]  Ming-Ching Chang,et al.  Probabilistic group-level motion analysis and scenario recognition , 2011, 2011 International Conference on Computer Vision.

[34]  Mubarak Shah,et al.  Tracking When the Camera Looks Away , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[35]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[37]  Indriyati Atmosukarto,et al.  Automatic Recognition of Offensive Team Formation in American Football Plays , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[38]  Mohamed R. Amer,et al.  HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos , 2014, ECCV.

[39]  Wang Yan,et al.  Visual recognition by counting instances: A multi-instance cardinality potential kernel , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Davis E. King,et al.  Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[42]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[43]  Marcello Pelillo,et al.  A Game-Theoretic Probabilistic Approach for Detecting Conversational Groups , 2014, ACCV.

[44]  Li Fei-Fei,et al.  Detecting Events and Key Actors in Multi-person Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Stefan Carlsson,et al.  Multi-Target Tracking - Linking Identities using Bayesian Network Inference , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[46]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Sridha Sridharan,et al.  Recognising Team Activities from Noisy Data , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[48]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[49]  Alberto Del Bimbo,et al.  Understanding Sport Activities from Correspondences of Clustered Trajectories , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[50]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[52]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[54]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[55]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[56]  Nicu Sebe,et al.  The S-HOCK dataset: Analyzing crowds at the stadium , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Noel E. O'Connor,et al.  Team Activity Recognition in Sports , 2012, ECCV.

[58]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[60]  Sridha Sridharan,et al.  Predicting Ball Ownership in Basketball from a Monocular View Using Only Player Trajectories , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[61]  Lei Chen,et al.  Deep Structured Models For Group Activity Recognition , 2015, BMVC.

[62]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[63]  Larry S. Davis,et al.  Recognizing Plays in American Football Videos , 2009 .