Discovering Groups of People in Images

Understanding group activities from images is an important yet challenging task. This is because there is an exponentially large number of semantic and geometrical relationships among individuals that one must model in order to effectively recognize and localize the group activities. Rather than focusing on directly recognizing group activities as most of the previous works do, we advocate the importance of introducing an intermediate representation for modeling groups of humans which we call structure groups. Such groups define the way people spatially interact with each other. People might be facing each other to talk, while others sit on a bench side by side, and some might stand alone. In this paper we contribute a method for identifying and localizing these structured groups in a single image despite their varying viewpoints, number of participants, and occlusions. We propose to learn an ensemble of discriminative interaction patterns to encode the relationships between people in 3D and introduce a novel efficient iterative augmentation algorithm for solving this complex inference problem. A nice byproduct of the inference scheme is an approximate 3D layout estimate of the structured groups in the scene. Finally, we contribute an extremely challenging new dataset that contains images each showing multiple people performing multiple activities. Extensive evaluation confirms our theoretical findings.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[3]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[4]  Ze-Nian Li BEYOND ACTIONS : DISCRIMINATIVE MODELS FOR CONTEXTUAL GROUP ACTIVITIES , 2010 .

[5]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[6]  Luc Van Gool,et al.  Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings , 2010, ECCV.

[7]  Larry S. Davis,et al.  Combining Per-frame and Per-track Cues for Multi-person Action Recognition , 2012, ECCV.

[8]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[9]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[10]  Thomas Deselaers,et al.  ClassCut for Unsupervised Class Segmentation , 2010, ECCV.

[11]  Yi Yang,et al.  Recognizing proxemics in personal photos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Pushmeet Kohli,et al.  Graph Cut Based Inference with Co-occurrence Statistics , 2010, ECCV.

[13]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[15]  Yang Wang,et al.  Retrieving Actions in Group Contexts , 2010, ECCV Workshops.

[16]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[17]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[19]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  David Elliott,et al.  In the Wild , 2010 .

[22]  Mohamed R. Amer,et al.  Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[23]  Matthieu Guillaumin,et al.  Segmentation Propagation in ImageNet , 2012, ECCV.

[24]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[28]  Kristen Grauman,et al.  Efficient activity detection with max-subgraph search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Shigeyuki Odashima,et al.  Collective Activity Localization with Contextual Spatial Pyramid , 2012, ECCV Workshops.

[30]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[31]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[32]  Kiriakos N. Kutulakos Trends and Topics in Computer Vision , 2010, Lecture Notes in Computer Science.

[33]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Jake K. Aggarwal,et al.  Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[35]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Dima Damen,et al.  Detecting Carried Objects in Short Video Sequences , 2008, ECCV.

[37]  Alexei A. Efros,et al.  Unsupervised Discovery of Mid-Level Discriminative Patches , 2012, ECCV.

[38]  Ian D. Reid,et al.  High Five: Recognising human interactions in TV shows , 2010, BMVC.

[39]  Martial Hebert,et al.  Toward Objective Evaluation of Image Segmentation Algorithms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Vittorio Ferrari,et al.  We Are Family: Joint Pose Estimation of Multiple Persons , 2010, ECCV.

[41]  S. Savarese,et al.  Learning an Image-Based Motion Context for Multiple People Tracking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.