DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from Video

Dexterous multi-fingered robotic hands have a formidable action space, 1 yet their morphological similarity to the human hand holds immense potential to 2 accelerate robot learning. We propose DexVIP, an approach to learn dexterous 3 robotic grasping from human-object interactions present in in-the-wild YouTube 4 videos. We do this by curating grasp images from human-object interaction videos 5 and imposing a prior over the agent’s hand pose when learning to grasp with deep 6 reinforcement learning. A key advantage of our method is that the learned policy 7 is able to leverage free-form in-the-wild visual data. As a result, it can easily 8 scale to new objects, and it sidesteps the standard practice of collecting human 9 demonstrations in a lab—a much more expensive and indirect way to capture 10 human expertise. Through experiments on 27 objects with a 30-DoF simulated 11 robot hand, we demonstrate that DexVIP compares favorably to existing approaches 12 that lack a hand pose prior or rely on specialized tele-operation equipment to obtain 13 human demonstrations, while also being faster to train. 14

[1]  Vikash Kumar,et al.  MuJoCo HAPTIX: A virtual reality system for hand manipulation , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[2]  David F. Fouhey,et al.  Understanding Human Hands in Contact at Internet Scale , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Takaaki Shiratori,et al.  FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration , 2020, ArXiv.

[4]  Kristen Grauman,et al.  Dexterous Robotic Grasping with Object-Centric Visual Affordances , 2020, ArXiv.

[5]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[6]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[7]  Kristen Grauman,et al.  Grounded Human-Object Interaction Hotspots From Video , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[9]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Peter Stone,et al.  Recent Advances in Imitation Learning from Observation , 2019, IJCAI.

[11]  Sergey Levine,et al.  Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Abhinav Gupta,et al.  Multiple Interactions Made Easy (MIME): Large Scale Demonstrations Data for Imitation , 2018, CoRL.

[13]  Ludovic Righetti,et al.  Leveraging Contact Forces for Learning to Grasp , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[14]  Christian Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[16]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[17]  Nikolaos G. Tsagarakis,et al.  Center-of-Mass-Based Grasp Pose Adaptation Using 3D Range and Force/Torque Sensing , 2018, Int. J. Humanoid Robotics.

[18]  Ilija Radosavovic,et al.  Reconstructing Hand-Object Interactions in the Wild , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yoichi Sato,et al.  Understanding Hand-Object Manipulation with Grasp Types and Object Attributes , 2016, Robotics: Science and Systems.

[20]  Hui Cheng,et al.  Learning Affordance Space in Physical World for Vision-based Robotic Object Manipulation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Dieter Fox,et al.  ContactGrasp: Functional Multi-finger Grasp Synthesis from Contact , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[22]  Sergey Levine,et al.  Learning dexterous manipulation for a soft robotic hand from human demonstrations , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Jian Chen,et al.  Grasp Proposal Networks: An End-to-End Solution for Visual Learning of Robotic Grasps , 2020, NeurIPS.

[24]  Xinyu Liu,et al.  Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics , 2017, Robotics: Science and Systems.

[25]  Yashraj S. Narang,et al.  DexYCB: A Benchmark for Capturing Hand Grasping of Objects , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Jitendra Malik,et al.  Zero-Shot Visual Imitation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Vikash Kumar,et al.  Fast, strong and compliant pneumatic actuation for dexterous tendon-driven hands , 2013, 2013 IEEE International Conference on Robotics and Automation.

[28]  Charles C. Kemp,et al.  ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Henry Zhu,et al.  Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[32]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[33]  Vishnu Naresh Boddeti,et al.  Gesture-based Bootstrapping for Egocentric Hand Segmentation , 2016, ArXiv.

[34]  Deepak Pathak,et al.  Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller , 2019, NeurIPS.

[35]  Vijay Kumar,et al.  Robotic grasping and contact: a review , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[36]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[38]  Sergey Levine,et al.  Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[39]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Andy Zeng,et al.  Learning to See before Learning to Act: Visual Pre-training for Manipulation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Aravind Rajeswaran,et al.  Learning Deep Visuomotor Policies for Dexterous Hand Manipulation , 2019, 2019 International Conference on Robotics and Automation (ICRA).

[42]  Danica Kragic,et al.  Learning Task-Oriented Grasping From Human Activity Datasets , 2019, IEEE Robotics and Automation Letters.

[43]  Joseph Redmon,et al.  Real-time grasp detection using convolutional neural networks , 2014, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[45]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[46]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.