Understanding hand-object manipulation by modeling the contextual relationship between actions, grasp types and object attributes

This paper proposes a novel method for understanding daily hand-object manipulation by developing computer vision-based techniques. Specifically, we focus on recognizing hand grasp types, object attributes and manipulation actions within an unified framework by exploring their contextual relationships. Our hypothesis is that it is necessary to jointly model hands, objects and actions in order to accurately recognize multiple tasks that are correlated to each other in hand-object manipulation. In the proposed model, we explore various semantic relationships between actions, grasp types and object attributes, and show how the context can be used to boost the recognition of each component. We also explore the spatial relationship between the hand and object in order to detect the manipulated object from hand in cluttered environment. Experiment results on all three recognition tasks show that our proposed method outperforms traditional appearance-based methods which are not designed to take into account contextual relationships involved in hand-object manipulation. The visualization and generalizability study of the learned context further supports our hypothesis.

[1]  Antonis A. Argyros,et al.  Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[2]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Larry S. Davis,et al.  Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[4]  Danica Kragic,et al.  Non-parametric hand pose estimation with object context , 2013, Image Vis. Comput..

[5]  Jia Liu,et al.  A taxonomy of everyday grasps in action , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[6]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[7]  James M. Rehg,et al.  Delving into egocentric actions , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Luc Van Gool,et al.  Tracking a hand manipulating an object , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Aaron M. Dollar,et al.  Analysis of Human Grasping Behavior: Correlating Tasks, Objects and Grasps , 2014, IEEE Transactions on Haptics.

[12]  Thomas Feix,et al.  A comprehensive grasp taxonomy , 2009 .

[13]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  Katsushi Ikeuchi,et al.  Toward automatic robot instruction from perception-recognizing a grasp from observation , 1993, IEEE Trans. Robotics Autom..

[15]  Danica Kragic,et al.  Visual recognition of grasps for human-to-robot mapping , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Yi Li,et al.  Grasp type revisited: A modern perspective on a classical feature for vision , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Aaron M. Dollar,et al.  An investigation of grasp type and frequency in daily household and machine shop tasks , 2011, 2011 IEEE International Conference on Robotics and Automation.

[18]  Kris M. Kitani,et al.  How do we use our hands? Discovering a diverse set of common grasps , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Iasonas Kokkinos,et al.  Understanding Objects in Detail with Fine-Grained Attributes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  B. P. McCloskey,et al.  Knowledge about hand shaping and knowledge about objects. , 1987, Journal of motor behavior.

[24]  Aaron M. Dollar,et al.  Classifying Human Hand Use and the Activities of Daily Living , 2014, The Human Hand as an Inspiration for Robot Hand Development.

[25]  Yoichi Sato,et al.  A scalable approach for understanding the visual structures of hand grasps , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Mark R. Cutkosky,et al.  On grasp choice, grasp models, and the design of hands for manufacturing tasks , 1989, IEEE Trans. Robotics Autom..

[27]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[28]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[29]  Danica Kragic,et al.  The GRASP Taxonomy of Human Grasp Types , 2016, IEEE Transactions on Human-Machine Systems.

[30]  Yoichi Sato,et al.  Understanding Hand-Object Manipulation with Grasp Types and Object Attributes , 2016, Robotics: Science and Systems.

[31]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[32]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[34]  Heiner Deubel,et al.  Contact points during multidigit grasping of geometric objects , 2011, Experimental Brain Research.

[35]  J. F. Soechting,et al.  Postural Hand Synergies for Tool Use , 1998, The Journal of Neuroscience.

[36]  Ramakant Nevatia,et al.  Multiple pose context trees for estimating human pose in object context , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[37]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[38]  Aaron M. Dollar,et al.  Finding small, versatile sets of human grasps to span common objects , 2013, 2013 IEEE International Conference on Robotics and Automation.

[39]  Aaron M. Dollar,et al.  Grasp Frequency and Usage in Daily Household and Machine Shop Tasks , 2013, IEEE Transactions on Haptics.

[40]  Yoichi Sato,et al.  An Ego-Vision System for Hand Grasp Analysis , 2017, IEEE Transactions on Human-Machine Systems.

[41]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[42]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[44]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Kris M. Kitani,et al.  Hand parsing for fine-grained recognition of human grasps in monocular images , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46]  J. Fischer,et al.  The Prehensile Movements of the Human Hand , 2014 .

[47]  Aaron M. Dollar,et al.  Analysis of Human Grasping Behavior: Object Characteristics and Grasp Type , 2014, IEEE Transactions on Haptics.