Robust sequence alignment for actor-object interaction recognition: Discovering actor-object states

In this paper, we address the problem of recognizing atomic human-object interactions from videos. Our method is based on the observation that, at the moment of physical contact with the object, both the motion and appearance (i.e., shape) of the interacting person are constrained by the target object. We introduce the concept of actor-object states as the instantaneous configuration of actor and object that usually corresponds to the moment of physical contact. We argue that the information content in frames belonging to the actor-object states is descriptive of the specific interaction. We use the actor-object state concept to propose an approach in which human-object interactions are represented by a combination of image patches and velocity information extracted along tracked body-point trajectories. However, determining the set of video frames corresponding to actor-object states is challenging as, before and after physical contact, human motion and appearance may vary significantly for the same interaction type. We address this issue by means of a robust sequence-matching algorithm that discovers actor-object states by matching pairs of misaligned sequences of features. We then show how these discovered actor-object states can be used for the recognition of basic interactions with objects. Finally, we evaluate the proposed concept on classification tasks performed on a new dataset of atomic human-object interactions.

[1]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Kevin W. Bowyer,et al.  Function-based generic recognition for multiple object categories , 1994 .

[4]  Eraldo Ribeiro,et al.  Human Motion Recognition Using Isomap and Dynamic Time Warping , 2007, Workshop on Human Motion.

[5]  Azriel Rosenfeld,et al.  Recognition by Functional Parts , 1995, Comput. Vis. Image Underst..

[6]  Taisuke Sato,et al.  Bayesian classification of task-oriented actions based on stochastic context-free grammar , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[7]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[8]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[9]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Thomas B. Moeslund,et al.  Finding Motion Primitives in Human Body Gestures , 2005, Gesture Workshop.

[11]  Jake K. Aggarwal,et al.  Semantic-level Understanding of Human Actions and Interactions using Event Hierarchy , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[12]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[13]  Svetha Venkatesh,et al.  Recognising Behaviours of Multiple People with Hierarchical Probabilistic Model and Statistical Data Association , 2006, BMVC.

[14]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Pinar Duygulu Sahin,et al.  Human Action Recognition Using Distribution of Oriented Rectangular Patches , 2007, Workshop on Human Motion.

[16]  Ehud Rivlin,et al.  Learning function-based object classification from 3D imagery , 2008, Comput. Vis. Image Underst..

[17]  Roman Filipovych,et al.  Recognizing primitive interactions by exploring actor-object states , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Donald J. Berndt,et al.  Finding Patterns in Time Series: A Dynamic Programming Approach , 1996, Advances in Knowledge Discovery and Data Mining.

[19]  Michal Irani,et al.  Detecting Irregularities in Images and in Video , 2005, ICCV.

[20]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[21]  Danica Kragic,et al.  Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects , 2008, ECCV.

[22]  Neill W Campbell,et al.  IEEE International Conference on Computer Vision and Pattern Recognition , 2008 .

[23]  Christoph Bregler,et al.  Learning and recognizing human dynamics in video sequences , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Dimitrios Gunopulos,et al.  Indexing multi-dimensional time-series with support for multiple distance measures , 2003, KDD '03.

[25]  Svetha Venkatesh,et al.  Combining image regions and human activity for indirect object recognition in indoor wide-angle views , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[26]  Max Lu,et al.  Robust and efficient foreground analysis for real-time video surveillance , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[27]  David J. Kriegman,et al.  Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Rama Chellappa,et al.  Recognition of Humans and Their Activities Using Video , 2005, Recognition of Humans and Their Activities Using Video.

[29]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[30]  Scott T. Grafton,et al.  Actions or Hand-Object Interactions? Human Inferior Frontal Cortex and Action Observation , 2003, Neuron.

[31]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[32]  Rama Chellappa,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 Matching Shape Sequences in Video with Applications in Human Movement Analysis. Ieee Transactions on Pattern Analysis and Machine Intelligence 2 , 2022 .

[33]  T. K. Vintsyuk Speech discrimination by dynamic programming , 1968 .

[34]  Kenneth F. Valyear,et al.  Human parietal cortex in action , 2006, Current Opinion in Neurobiology.

[35]  Sudeep Sarkar,et al.  Automated extraction of signs from continuous sign language sentences using Iterated Conditional Modes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Aaron F. Bobick,et al.  Parametric Hidden Markov Models for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  V. Ramasubramanian,et al.  Towards fast, view-invariant human action recognition , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[38]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[40]  A. Elgammal,et al.  Inferring 3D body pose from silhouettes using activity manifold learning , 2004, CVPR 2004.

[41]  Mubarak Shah,et al.  View-Invariant Representation and Recognition of Actions , 2002, International Journal of Computer Vision.

[42]  Tiziana D'Orazio,et al.  Complex human activity recognition for monitoring wide outdoor environments , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..