Close Human Interaction Recognition Using Patch-Aware Models

This paper addresses the problem of recognizing human interactions with close physical contact from videos. Due to ambiguities in feature-to-person assignments and frequent occlusions in close interactions, it is difficult to accurately extract the interacting people. This degrades the recognition performance. We, therefore, propose a hierarchical model, which recognizes close interactions and infers supporting regions for each interacting individual simultaneously. Our model associates a set of hidden variables with spatiotemporal patches and discriminatively infers their states, which indicate the person that the patches belong to. This patch-aware representation explicitly models and accounts for discriminative supporting regions for individuals, and thus overcomes the problem of ambiguities in feature assignments. Moreover, we incorporate the prior for the patches to deal with frequent occlusions during interactions. Using the discriminative supporting regions, our model builds cleaner features for individual action recognition and interaction recognition. Extensive experiments are performed on the BIT-Interaction data set and the UT-Interaction data set set #1 and set #2, and validate the effectiveness of our approach.

[1]  Leonid Sigal,et al.  Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[3]  Bingbing Ni,et al.  Recognizing human group activities with localized causalities , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Roman Filipovych,et al.  Recognizing primitive interactions by exploring actor-object states , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Shaogang Gong,et al.  Recognition of group activities using dynamic probabilistic networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  D. Sontag 1 Introduction to Dual Decomposition for Inference , 2010 .

[8]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[9]  Ian D. Reid,et al.  Structured Learning of Human Interactions in TV Shows , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Silvio Savarese,et al.  A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[12]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[13]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[14]  Shigeyuki Odashima,et al.  Collective Activity Localization with Contextual Spatial Pyramid , 2012, ECCV Workshops.

[15]  Yunde Jia,et al.  Interactive Phrases: Semantic Descriptionsfor Human Interaction Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[17]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Silvio Savarese,et al.  Learning context for collective activity recognition , 2011, CVPR 2011.

[19]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[20]  Bo Gao,et al.  A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[21]  Jake K. Aggarwal,et al.  Stochastic Representation and Recognition of High-Level Group Activities , 2011, International Journal of Computer Vision.

[22]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Rama Chellappa,et al.  Learning multi-modal densities on Discriminative Temporal Interaction Manifold for group activity recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[26]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Thierry Artières,et al.  Large margin training for hidden Markov models with partially observed states , 2009, ICML '09.

[28]  Tommi S. Jaakkola,et al.  Introduction to dual composition for inference , 2011 .

[29]  Jake K. Aggarwal,et al.  An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[30]  Tae-Kyun Kim,et al.  Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests , 2010, BMVC.

[31]  Yang Wang,et al.  Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[33]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Yun Fu,et al.  Modeling Supporting Regions for Close Human Interaction Recognition , 2014, ECCV Workshops.

[35]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[36]  Charless C. Fowlkes,et al.  Discriminative models for static human-object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[37]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[40]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.