Humans in groups: The importance of contextual information for understanding collective activities

In this work we consider the problem of modeling and recognizing collective activities performed by groups of people sharing a common purpose. For this aim we take into account the social contextual information of each person, in terms of the relative orientation and spatial distribution of people groups. We propose a method able to process a video stream and, at each time instant, associate a collective activity with each individual in the scene, by representing the individual - or target - as a part of a group of nearby people - the target group. To generalize with respect to the viewpoint we associate each target with a reference frame based on his spatial orientation, which we estimate automatically by semi-supervised learning. Then, we model the social context of a target by organizing a set of instantaneous descriptors, capturing the essence of mutual positions and orientations within the target group, in a graph structure. Classification of collective activities is achieved with a multi-class SVM endowed with a novel kernel function for graphs. We report an extensive experimental analysis on benchmark datasets that validates the proposed solution and shows significant improvements with respect to state-of-art results.

[1]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Nicoletta Noceti,et al.  Learning common behaviors from large sets of unlabeled temporal series , 2012, Image Vis. Comput..

[3]  Chi Fang,et al.  Head Pose Estimation Based on Random Forests for Multiclass Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[4]  Bingbing Ni,et al.  Recognizing human group activities with localized causalities , 2009, CVPR 2009.

[5]  Alon Orlitsky,et al.  Combined binary classifiers with applications to speech recognition , 2002, INTERSPEECH.

[6]  Giovanni Fusco,et al.  Structured Multi-class Feature Selection for Effective Face Recognition , 2013, ICIAP.

[7]  Ilya Narsky,et al.  Reducing Multiclass to Binary , 2013 .

[8]  Bodo Rosenhahn,et al.  Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[9]  Alexandre Heili,et al.  Combined estimation of location and body pose in surveillance video , 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[10]  Peng Dai,et al.  Group Interaction Analysis in Dynamic Context$^{\ast}$ , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Andrew Zisserman,et al.  Pose search: Retrieving people using their pose , 2009, CVPR 2009.

[12]  Antonio Torralba,et al.  Sharing Visual Features for Multiclass and Multiview Object Detection , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  E. Hall The Silent Language , 1959 .

[14]  Samy Bengio,et al.  Modeling individual and group actions in meetings with layered HMMs , 2006, IEEE Transactions on Multimedia.

[15]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[16]  Anton van den Hengel,et al.  Sharing features in multi-class boosting via group sparsity , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[18]  Larry S. Davis,et al.  Human detection using partial least squares analysis , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[19]  Rita Cucchiara,et al.  People Orientation Recognition by Mixtures of Wrapped Distributions on Random Trees , 2012, ECCV.

[20]  B. Zadrozny Reducing multiclass to binary by coupling probability estimates , 2001, NIPS.

[21]  Jun Huan,et al.  GPM: A graph pattern matching kernel with diffusion for chemical compound classification , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.

[22]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[23]  Eddy Mayoraz Multiclass Classification with Pairwise Coupled Neural Networks or Support Vector Machines , 2001, ICANN.

[24]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[25]  Jean-Marc Odobez,et al.  We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[27]  Ramin Mehran,et al.  Abnormal crowd behavior detection using social force model , 2009, CVPR.

[28]  Antonio Torralba,et al.  Contextual Influences on Saliency , 2004 .

[29]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[31]  Luc Van Gool,et al.  You'll never walk alone: Modeling social behavior for multi-target tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[33]  Greg Mori,et al.  Social roles in hierarchical models for human activity recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Montse Pardàs,et al.  Head Orientation Estimation Using Particle Filtering in Multiview Scenarios , 2007, CLEAR.

[35]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Martial Hebert,et al.  A spectral technique for correspondence problems using pairwise constraints , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[37]  Luca Zini,et al.  Efficient pedestrian detection with group lasso , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[38]  Bernt Schiele,et al.  A Performance Evaluation of Single and Multi-feature People Detection , 2008, DAGM-Symposium.

[39]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[40]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[41]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[42]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[44]  Kate Saenko,et al.  A combined pose, object, and feature model for action understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Volker Roth,et al.  Pairwise coupling for machine recognition of hand-printed Japanese characters , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[46]  Maja Pantic,et al.  Social signal processing: Survey of an emerging domain , 2009, Image Vis. Comput..

[47]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[48]  Yang Wang,et al.  Retrieving Actions in Group Contexts , 2010, ECCV Workshops.

[49]  H. A. David,et al.  The method of paired comparisons , 1966 .

[50]  Andrew Zisserman,et al.  2D Human Pose Estimation in TV Shows , 2009, Statistical and Geometrical Approaches to Visual Motion Analysis.

[51]  Amit K. Roy-Chowdhury,et al.  A “string of feature graphs” model for recognition of complex activities in natural videos , 2011, 2011 International Conference on Computer Vision.

[52]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[53]  Juan Carlos Niebles,et al.  Spatial-Temporal correlatons for unsupervised action classification , 2008, 2008 IEEE Workshop on Motion and video Computing.

[54]  Yang Wang,et al.  Beyond Actions: Discriminative Models for Contextual Group Activities , 2010, NIPS.

[55]  Mei-Chen Yeh,et al.  Fast Human Detection Using a Cascade of Histograms of Oriented Gradients , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).