Learning context for collective activity recognition

In this paper we present a framework for the recognition of collective human activities. A collective activity is defined or reinforced by the existence of coherent behavior of individuals in time and space. We call such coherent behavior ‘Crowd Context’. Examples of collective activities are “queuing in a line” or “talking”. Following [7], we propose to recognize collective activities using the crowd context and introduce a new scheme for learning it automatically. Our scheme is constructed upon a Random Forest structure which randomly samples variable volume spatio-temporal regions to pick the most discriminating attributes for classification. Unlike previous approaches, our algorithm automatically finds the optimal configuration of spatio-temporal bins, over which to sample the evidence, by randomization. This enables a methodology for modeling crowd context. We employ a 3D Markov Random Field to regularize the classification and localize collective activities in the scene. We demonstrate the flexibility and scalability of the proposed framework in a number of experiments and show that our method outperforms state-of-the art action classification techniques [7, 19].

[1]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[3]  Martial Hebert,et al.  Efficient visual event detection using volumetric features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[4]  Shuicheng Yan,et al.  Pair-activity classification by bi-trajectories analysis , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  Mubarak Shah,et al.  Learning, detection and representation of multi-agent events in videos , 2007, Artif. Intell..

[7]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Jintao Li,et al.  Hierarchical spatio-temporal context modeling for action recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Marshall F. Tappen,et al.  Learning pedestrian dynamics from the real world , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[17]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[19]  Luc Van Gool,et al.  A mobile vision system for robust multi-person tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[21]  Pietro Perona,et al.  Hybrid models for human motion recognition , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[22]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Stan Z. Li,et al.  Markov Random Field Models in Computer Vision , 1994, ECCV.

[24]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[25]  Dorin Comaniciu,et al.  Mean shift analysis and applications , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[26]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[27]  Shaogang Gong,et al.  Scene Segmentation for Behaviour Correlation , 2008, ECCV.

[28]  Tae-Kyun Kim,et al.  Tensor Canonical Correlation Analysis for Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Yang Wang,et al.  Retrieving Actions in Group Contexts , 2010, ECCV Workshops.

[30]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).