Discriminative context learning with gated recurrent unit for group activity recognition

Abstract In this study, we address the problem of similar local motions that create confusion within different group activities. To reduce the influences of motions, we propose a discriminative group context feature (DGCF) that considers prominent sub-events. Moreover, we adopt a gated recurrent unit (GRU) model that can learn temporal changes in a sequence. In real-world scenarios, people perform activities with different temporal lengths. The GRU model handles an arbitrary length of data for training with nonlinear hidden units in the network. However, when we use a deep neural network model, data scarcity causes overfitting problems. Data augmentation methods for images are ineffective for trajectory data augmentation. Thus, we also propose a method for trajectory augmentation. We evaluate the effectiveness of the proposed method on three datasets. In our experiments on each dataset, we show that the proposed method outperforms the competing state-of-the-art methods for group activity recognition.

[1]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[2]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[3]  Sang-Woong Lee,et al.  Volume Motion Template for View-Invariant Gesture Recognition , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[4]  Guang Yang,et al.  Small group human activity recognition , 2012, 2012 19th IEEE International Conference on Image Processing.

[5]  Anuj Srivastava,et al.  Action Recognition Using Rate-Invariant Analysis of Skeletal Shape Trajectories , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Wei Gao,et al.  Detecting Rumors from Microblogs with Recurrent Neural Networks , 2016, IJCAI.

[7]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Xiao-Li Meng,et al.  The Art of Data Augmentation , 2001 .

[9]  Vittorio Murino,et al.  Towards Computational Proxemics: Inferring Social Relations from Interpersonal Distances , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[10]  Nicoletta Noceti,et al.  Humans in groups: The importance of contextual information for understanding collective activities , 2014, Pattern Recognit..

[11]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[12]  Qi Tian,et al.  Recognizing human group action by layered model with multiple cues , 2014, Neurocomputing.

[13]  Yoshua Bengio,et al.  ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks , 2015, ArXiv.

[14]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[16]  C. Krishna Mohan,et al.  Graph formulation of video activities for abnormal activity recognition , 2017, Pattern Recognit..

[17]  Xiaogang Wang,et al.  Pedestrian Behavior Modeling From Stationary Crowds With Applications to Intelligent Surveillance , 2016, IEEE Transactions on Image Processing.

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  Jorge S. Marques,et al.  Modeling and Classifying Human Activities From Trajectories Using a Class of Space-Varying Parametric Motion Fields , 2013, IEEE Transactions on Image Processing.

[20]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[21]  Unsang Park,et al.  Group Activity Recognition with Group Interaction Zone Based on Relative Distance Between Human Objects , 2015, Int. J. Pattern Recognit. Artif. Intell..

[22]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Michael Arens,et al.  Supporting Fuzzy Metric Temporal Logic Based Situation Recognition by Mean Shift Clustering , 2012, KI.

[24]  Yun Fu,et al.  Close Human Interaction Recognition Using Patch-Aware Models , 2016, IEEE Transactions on Image Processing.

[25]  Meng Wang,et al.  A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition , 2015, International Journal of Computer Vision.

[26]  Wei-Shi Zheng,et al.  Learning Person–Person Interaction in Collective Activity Recognition , 2015, IEEE Transactions on Image Processing.

[27]  Adrian G. Bors,et al.  Group activity recognition on outdoor scenes , 2016, 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[28]  Robert B. Fisher,et al.  The BEHAVE video dataset: ground truthed video for multi-person behavior classification , 2010 .

[29]  E. Hall,et al.  The Hidden Dimension , 1970 .

[30]  Zhong Zhou,et al.  Tweet2Vec: Character-Based Distributed Representations for Social Media , 2016, ACL.

[31]  Zhiyong Feng,et al.  Affective interaction recognition using spatio-temporal features and context , 2016, Comput. Vis. Image Underst..

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lei Sun,et al.  Localizing activity groups in videos , 2016, Comput. Vis. Image Underst..

[35]  Inchul Song,et al.  RNNDROP: A novel dropout for RNNS in ASR , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[36]  Heung-Il Suk,et al.  Volumetric spatial feature representation for view-invariant human action recognition using a depth camera , 2015 .

[37]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[38]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Xiaogang Wang,et al.  Deeply learned attributes for crowded scene understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  David M. Lane,et al.  Human behaviour recognition in data-scarce domains , 2015, Pattern Recognit..

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Jun Zhu,et al.  Recognizing Human Group Behaviors with Multi-group Causalities , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[43]  Lianhong Cai,et al.  Question detection from acoustic features using recurrent neural network with gated recurrent unit , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[44]  Junsong Yuan,et al.  Abnormal event detection in crowded scenes using sparse representation , 2013, Pattern Recognit..

[45]  Dong-Gyu Lee,et al.  Motion Influence Map for Unusual Human Activity Detection and Localization in Crowded Scenes , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[46]  Jianxin Wu,et al.  A Heat-Map-Based Algorithm for Recognizing Group Activities in Videos , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[47]  Dong-Gyu Lee,et al.  Human activity prediction based on Sub-volume Relationship Descriptor , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[48]  Xiaogang Wang,et al.  Crowded Scene Understanding by Deeply Learned Volumetric Slices , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[49]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).