A Hierarchical Model for Human Interaction Recognition

Recognizing human interactions is a challenging task due to partially occluded body parts and motion ambiguities in interactions. We observe that the interdependencies existing at both action level and body part level greatly help disambiguate similar individual movements and facilitate human interaction recognition. In this paper, we propose a novel hierarchical model to capture such interdependencies for recognizing interactions of two persons. We model the action of each person by a large-scale global feature and several body part features. Two types of contextual information are exploited in our model to capture the implicit and complex interdependencies between interaction class, the action classes of two persons and the labels of persons' body parts. We build a challenging human interaction dataset to test our method. Results show that our model is quite effective in recognizing human interactions.

[1]  Ze-Nian Li BEYOND ACTIONS : DISCRIMINATIVE MODELS FOR CONTEXTUAL GROUP ACTIVITIES , 2010 .

[2]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Shaogang Gong,et al.  Recognition of group activities using dynamic probabilistic networks , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[8]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Mubarak Shah,et al.  Recognizing human actions using multiple features , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[12]  Danica Kragic,et al.  Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects , 2008, ECCV.

[13]  Greg Mori,et al.  Max-margin hidden conditional random fields for human action recognition , 2009, CVPR.

[14]  Jake K. Aggarwal,et al.  Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[15]  Lior Wolf,et al.  A Critical View of Context , 2006, International Journal of Computer Vision.

[16]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[17]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[18]  Roman Filipovych,et al.  Recognizing primitive interactions by exploring actor-object states , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.