Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses

Detecting objects in cluttered scenes and estimating articulated human body parts from 2D images are two challenging problems in computer vision. The difficulty is particularly pronounced in activities involving human-object interactions (e.g., playing tennis), where the relevant objects tend to be small or only partially visible and the human body parts are often self-occluded. We observe, however, that objects and human poses can serve as mutual context to each other-recognizing one facilitates the recognition of the other. In this paper, we propose a mutual context model to jointly model objects and human poses in human-object interaction activities. In our approach, object detection provides a strong prior for better human pose estimation, while human pose estimation improves the accuracy of detecting the objects that interact with the human. On a six-class sports data set and a 24-class people interacting with musical instruments data set, we show that our mutual context model outperforms state of the art in detecting very difficult objects and estimating human poses, as well as classifying human-object interaction activities.

[1]  I. Biederman,et al.  Scene perception: Detecting and judging objects undergoing relational violations , 1982, Cognitive Psychology.

[2]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  J. Henderson Human gaze control during real-world scene perception , 2003, Trends in Cognitive Sciences.

[5]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[6]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[7]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[12]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[13]  W. Prinz,et al.  Action comprehension: deriving spatial and functional relations. , 2005, Journal of experimental psychology. Human perception and performance.

[14]  Jitendra Malik,et al.  Recovering human body configurations using pairwise constraints between parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  D. Bub,et al.  Gestural knowledge evoked by objects as part of conceptual representations , 2006 .

[17]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[19]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[20]  Markus Graf,et al.  The role of action representations in visual object recognition , 2006, Experimental Brain Research.

[21]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[22]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[23]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[24]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.

[25]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Yang Wang,et al.  Multiple Tree Models for Occlusion and Spatial Constraints in Human Pose Estimation , 2008, ECCV.

[27]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Hans-Peter Seidel,et al.  Staying Well Grounded in Markerless Motion Capture , 2008, DAGM-Symposium.

[29]  Cordelia Schmid,et al.  Viewpoint-independent object class detection using 3D Feature Maps , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Larry S. Davis,et al.  Context and observation driven latent variable model for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Alexei A. Efros,et al.  An empirical study of context in object detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Barbara Caputo,et al.  Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation , 2009, NIPS.

[33]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Vittorio Ferrari,et al.  Better Appearance Models for Pictorial Structures , 2009, BMVC.

[36]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[37]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[39]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[40]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  David J. Fleet,et al.  Estimating contact dynamics , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[42]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Ivan Laptev,et al.  Recognizing human actions in still images: a study of bag-of-features and part-based representations , 2010, BMVC.

[44]  Ramakant Nevatia,et al.  Multiple pose context trees for estimating human pose in object context , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[45]  Charless C. Fowlkes,et al.  Discriminative models for static human-object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[46]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[47]  Ben Taskar,et al.  Cascaded Models for Articulated Pose Estimation , 2010, ECCV.

[48]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[49]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[50]  Danica Kragic,et al.  Tracking people interacting with objects , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[53]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[54]  Li Fei-Fei,et al.  Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses , 2011 .

[55]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[56]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[57]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.