Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition

Interpretation of images and videos containing humans interacting with different objects is a daunting task. It involves understanding scene or event, analyzing human movements, recognizing manipulable objects, and observing the effect of the human movement on those objects. While each of these perceptual tasks can be conducted independently, recognition rate improves when interactions between them are considered. Motivated by psychological studies of human perception, we present a Bayesian approach which integrates various perceptual tasks involved in understanding human-object interactions. Previous approaches to object and action recognition rely on static shape or appearance feature matching and motion analysis, respectively. Our approach goes beyond these traditional approaches and applies spatial and functional constraints on each of the perceptual elements for coherent semantic interpretation. Such constraints allow us to recognize objects and actions when the appearances are not discriminative enough. We also demonstrate the use of such constraints in recognition of actions from static images without using any motion information.

[1]  A. Wing,et al.  The Psychology of human movement , 1984 .

[2]  M. Jeannerod,et al.  Constraints on human arm movement trajectories. , 1987, Canadian journal of psychology.

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[4]  Hans-Hellmut Nagel,et al.  From image sequences towards conceptual descriptions , 1988, Image Vis. Comput..

[5]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[6]  Kevin W. Bowyer,et al.  Generic recognition through qualitative reasoning about 3-D shape and object function , 1991, CVPR.

[7]  Marie-Christine Jaulent,et al.  Object structure and action requirements: A compatibility model for functional recognition , 1991, Int. J. Intell. Syst..

[8]  Hiroshi Murase,et al.  Learning Object Models from Appearance , 1993, AAAI.

[9]  大西 仁,et al.  Pearl, J. (1988, second printing 1991). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufmann. , 1994 .

[10]  Sven J. Dickinson,et al.  Recognition by functional parts [function-based object recognition] , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Azriel Rosenfeld,et al.  Recognition by Functional Parts , 1995, Comput. Vis. Image Underst..

[12]  G. Rizzolatti,et al.  Premotor cortex and the recognition of motor actions. , 1996, Brain research. Cognitive brain research.

[13]  G. Rizzolatti,et al.  Action recognition in the premotor cortex. , 1996, Brain : a journal of neurology.

[14]  Allan D. Jepson,et al.  Computational Perception of Scene Dynamics , 1996, ECCV.

[15]  Ehud Rivlin,et al.  Function From Motion , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Allan D. Jepson,et al.  The Computational Perception of Scene Dynamics , 1997, Comput. Vis. Image Underst..

[17]  Mubarak Shah,et al.  Motion-Based Recognition , 1997, Computational Imaging and Vision.

[18]  Aaron F. Bobick,et al.  A State-Based Approach to the Representation and Recognition of Gesture , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Aaron F. Bobick,et al.  Parametric Hidden Markov Models for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[21]  N. Kanwisher,et al.  Activation in Human MT/MST by Static Images with Implied Motion , 2000, Journal of Cognitive Neuroscience.

[22]  Alex Martin,et al.  Representation of Manipulable Man-Made Objects in the Dorsal Stream , 2000, NeuroImage.

[23]  Jitendra Malik,et al.  Geometric blur for template matching , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[24]  Antonio Torralba,et al.  Statistical Context Priming for Object Detection , 2001, ICCV.

[25]  Hui Gao,et al.  A three-mode expressive feature model of action effort , 2002, Workshop on Motion and Video Computing, 2002. Proceedings..

[26]  Stefan Carlsson,et al.  Recognizing and Tracking Human Action , 2002, ECCV.

[27]  Yasuo Kuniyoshi,et al.  A self-organizing neural model for context-based action recognition , 2003, First International IEEE EMBS Conference on Neural Engineering, 2003. Conference Proceedings..

[28]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[29]  Scott T. Grafton,et al.  Actions or Hand-Object Interactions? Human Inferior Frontal Cortex and Action Observation , 2003, Neuron.

[30]  Antonio Torralba,et al.  Graphical Model For Recognizing Scenes and Objects. , 2003, NIPS 2003.

[31]  Mubarak Shah,et al.  View-Invariant Representation and Recognition of Actions , 2002, International Journal of Computer Vision.

[32]  Ankur Agarwal,et al.  3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[33]  Z. Kourtzi ‘But still, it moves’ , 2004, Trends in Cognitive Sciences.

[34]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[35]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[36]  Patrick Pérez,et al.  Interactive Image Segmentation Using an Adaptive GMMRF Model , 2004, ECCV.

[37]  W. Prinz,et al.  Action comprehension: deriving spatial and functional relations. , 2005, Journal of experimental psychology. Human perception and performance.

[38]  Alexander Vezhnevets,et al.  ‘ Modest AdaBoost ’ – Teaching AdaBoost to Generalize Better , 2005 .

[39]  Mubarak Shah,et al.  Actions sketch: a novel action representation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40]  Y. Aloimonos,et al.  Discovering a Language for Human Activity 1 , 2005 .

[41]  Antonio Torralba,et al.  Learning hierarchical models of scenes, objects, and parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[42]  G. Orban,et al.  Observing Others: Multiple Action Representation in the Frontal Lobe , 2005, Science.

[43]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[44]  Svetha Venkatesh,et al.  Combining image regions and human activity for indirect object recognition in indoor wide-angle views , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[45]  Ramakant Nevatia,et al.  Detection of multiple, partially occluded humans in a single image by Bayesian combination of edgelet part detectors , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[46]  M. Candidi,et al.  Mapping Implied Body Actions in the Human Motor System , 2006, The Journal of Neuroscience.

[47]  Mei-Chen Yeh,et al.  Fast Human Detection Using a Cascade of Histograms of Oriented Gradients , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[48]  D. Bub,et al.  Gestural knowledge evoked by objects as part of conceptual representations , 2006 .

[49]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[50]  Larry S. Davis,et al.  Ballistic Hand Movements , 2006, AMDO.

[51]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[52]  Yang Wang,et al.  Unsupervised Discovery of Action Classes , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[53]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[54]  Markus Graf,et al.  The role of action representations in visual object recognition , 2006, Experimental Brain Research.

[55]  Ramakant Nevatia,et al.  Detection and Tracking of Multiple Humans with Extensive Pose Articulation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[56]  Andrea Vedaldi,et al.  Objects in Context , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[57]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[58]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[59]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[60]  Larry S. Davis,et al.  Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[62]  Narendra Ahuja,et al.  Learning subcategory relevances for category recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Larry S. Davis,et al.  A "Shape Aware" Model for semi-supervised Learning of Objects and its Context , 2008, NIPS.

[65]  Roman Filipovych,et al.  Recognizing primitive interactions by exploring actor-object states , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Larry S. Davis,et al.  Constraint Integration for Efficient Multiview Pose Estimation with Self-Occlusions , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Larry S. Davis,et al.  Context and observation driven latent variable model for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[69]  Larry S. Davis,et al.  Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos , 2009, CVPR.