Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects

The visual analysis of human manipulation actions is of interest for e.g. human-robot interaction applications where a robot learns how to perform a task by watching a human. In this paper, a method for classifying manipulation actions in the context of the objects manipulated, and classifying objects in the context of the actions used to manipulate them is presented. Hand and object features are extracted from the video sequence using a segmentation based approach. A shape based representation is used for both the hand and the object. Experiments show this representation suitable for representing generic shape classes. The action-object correlation over time is then modeled using conditional random fields. Experimental comparison show great improvement in classification rate when the action-object correlation is taken into account, compared to separate classification of manipulation actions and manipulated objects.

[1]  Michael I. Mandel,et al.  Visual Hand Tracking Using Nonparametric Belief Propagation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[2]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[3]  Stefan Carlsson,et al.  Recognizing and Tracking Human Action , 2002, ECCV.

[4]  Azriel Rosenfeld,et al.  Recognition by Functional Parts , 1995, Comput. Vis. Image Underst..

[5]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Narendra Ahuja,et al.  Extraction of 2D Motion Trajectories and Its Application to Hand Gesture Recognition , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Manolis I. A. Lourakis,et al.  Real-Time Tracking of Multiple Skin-Colored Objects with a Possibly Moving Camera , 2004, ECCV.

[8]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Cordelia Schmid,et al.  Scale-invariant shape features for recognition of object categories , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[10]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[11]  Irfan A. Essa,et al.  Exploiting human actions and object context for recognition tasks , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[12]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[13]  James M. Rehg,et al.  A Scalable Approach to Activity Recognition based on Object Use , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[15]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  David G. Stork,et al.  Generic object recognition using form and function , 1998, Pattern Analysis and Applications.

[18]  Nando de Freitas,et al.  A Statistical Model for General Contextual Object Recognition , 2004, ECCV.

[19]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[20]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[21]  Björn Stenger,et al.  Model-based hand tracking using a hierarchical Bayesian filter , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  D. Stork Generic object recognition using form & function , 1996 .

[23]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[24]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Trevor Darrell,et al.  Learning Visual Representations using Images with Captions , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Adrian Hilton,et al.  A survey of advances in vision-based human motion capture and analysis , 2006, Comput. Vis. Image Underst..

[27]  Allan D. Jepson,et al.  Towards the computational perception of action , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[28]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[29]  Antonio Torralba,et al.  Contextual Models for Object Detection Using Boosted Random Fields , 2004, NIPS.

[30]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .