Recognition of Group Activities in Videos Based on Single-and Two-Person Descriptors

Group activity recognition from videos is a very challenging problem that has barely been addressed. We propose an activity recognition method using group context. In order to encode both single-person description and two-person interactions, we learn mappings from highdimensional feature spaces to low-dimensional dictionaries. In particular the proposed two-person descriptor takes into account geometric characteristics of the relative pose and motion between the two persons. Both single-person and two-person representations are then used to define unary and pairwise potentials of an energy function, whose optimization leads to the structured labeling of persons involved in the same activity. An interesting feature of the proposed method is that, unlike the vast majority of existing methods, it is able to recognize multiple distinct group activities occurring simultaneously in a video. The proposed method is evaluated with datasets widely used for group activity recognition, and is compared with several baseline methods.

[1]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[2]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[3]  François Brémond,et al.  Group behavior recognition with multiple cameras , 2002, Sixth IEEE Workshop on Applications of Computer Vision, 2002. (WACV 2002). Proceedings..

[4]  Silvio Savarese,et al.  Discovering Groups of People in Images , 2014, ECCV.

[5]  Bhiksha Raj,et al.  Beyond Gaussian Pyramid: Multi-skip Feature Stacking for action recognition , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Danica Kragic,et al.  Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects , 2008, ECCV.

[7]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[8]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[9]  Silvio Savarese,et al.  Understanding Collective Activitiesof People from Videos , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[12]  Wang Yan,et al.  Visual recognition by counting instances: A multi-instance cardinality potential kernel , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yang Wang,et al.  Retrieving Actions in Group Contexts , 2010, ECCV Workshops.

[14]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Shigeyuki Odashima,et al.  Collective Activity Localization with Contextual Spatial Pyramid , 2012, ECCV Workshops.

[16]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Rama Chellappa,et al.  Learning multi-modal densities on Discriminative Temporal Interaction Manifold for group activity recognition , 2009, CVPR.

[18]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[19]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[20]  Lei Chen,et al.  Deep Structured Models For Group Activity Recognition , 2015, BMVC.

[21]  Alessio Del Bue,et al.  Temporal Poselets for Collective Activity Detection and Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[22]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[23]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Lei Sun,et al.  Activity Group Localization by Modeling the Relations among Participants , 2014, ECCV.

[25]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Jake K. Aggarwal,et al.  Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[27]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[28]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[29]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[30]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[31]  Silvio Savarese,et al.  What are they doing? : Collective activity classification using spatio-temporal relationship among people , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[32]  Mohamed R. Amer,et al.  A chains model for localizing participants of group activities in videos , 2011, 2011 International Conference on Computer Vision.

[33]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Mohamed R. Amer,et al.  HiRF: Hierarchical Random Field for Collective Activity Recognition in Videos , 2014, ECCV.

[35]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[36]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Ze-Nian Li BEYOND ACTIONS : DISCRIMINATIVE MODELS FOR CONTEXTUAL GROUP ACTIVITIES , 2010 .

[38]  Larry S. Davis,et al.  Combining Per-frame and Per-track Cues for Multi-person Action Recognition , 2012, ECCV.

[39]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).