Joint classification of actions and object state changes with a latent variable discriminative model

We present a technique to classify human actions that involve object manipulation. Our focus is to accurately distinguish between actions that are related in that the object's state changes define the essential differences. Our algorithm uses a latent variable conditional random field that allows for the modelling of spatio-temporal relationships between the human motion and the corresponding object state changes. Our approach involves a factored representation that better allows for the description of causal effects in the way human action causes object state changes. The utility of incorporating such structure in our model is that it enables more accurate classification of activities that could enable robots to reason about interaction, and to learn using a high level vocabulary that captures phenomena of interest. We present experiments involving the recognition of human actions, where we show that our factored representation achieves superior performance in comparison to alternate flat representations.

[1]  José Santos-Victor,et al.  Visual learning by imitation with motor representations , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[2]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Manuel Lopes,et al.  Learning Object Affordances: From Sensory--Motor Coordination to Imitation , 2008, IEEE Transactions on Robotics.

[7]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  J. Gibson The Ecological Approach to the Visual Perception of Pictures , 1978 .

[9]  Eren Erdal Aksoy,et al.  Learning the semantics of object–action relations by observation , 2011, Int. J. Robotics Res..

[10]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[11]  Yang Wang,et al.  Recognizing human actions from still images with latent poses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Giulio Sandini,et al.  Learning about objects through action - initial steps towards artificial cognition , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[13]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[15]  Radu Bogdan Rusu,et al.  3D is here: Point Cloud Library (PCL) , 2011, 2011 IEEE International Conference on Robotics and Automation.

[16]  Danica Kragic,et al.  Learning Actions from Observations , 2010, IEEE Robotics & Automation Magazine.

[17]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.