Generalization Through Hand-Eye Coordination: An Action Space for Learning Spatially-Invariant Visuomotor Control

Imitation Learning (IL) is an effective framework to learn visuomotor skills from offline demonstration data. However, IL methods often fail to generalize to new scene configurations not covered by training data. On the other hand, humans can manipulate objects in varying conditions. Key to such capability is hand-eye coordination, a cognitive ability that enables humans to adaptively direct their movements at task-relevant objects and be invariant to the objects’ absolute spatial location. In this work, we present a learnable action space, Hand-eye Action Networks (HAN) that learns coordinated hand-eye movements from human teleoperated demonstrations. Through a set of challenging multi-stage manipulation tasks, we show that a visuomotor policy equipped with HAN is able to inherit the key spatial invariance property of handeye coordination and achieve generalization to new scene configurations. Additional materials available at https://sites.google.com/stanford.edu/han

[1]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[2]  Russ Tedrake,et al.  Keypoints into the Future: Self-Supervised Correspondence in Model-Based Reinforcement Learning , 2020, CoRL.

[3]  Anca D. Dragan,et al.  Comparing human-centric and robot-centric sampling for robot deep learning from demonstrations , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Darwin G. Caldwell,et al.  Learning and Reproduction of Gestures by Imitation , 2010, IEEE Robotics & Automation Magazine.

[5]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[6]  Ian Taylor,et al.  Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[8]  Dana H. Ballard,et al.  Visual Attention Guided Deep Imitation Learning , 2017 .

[9]  Alberto Rodriguez,et al.  Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Roberto Mart'in-Mart'in,et al.  robosuite: A Modular Simulation Framework and Benchmark for Robot Learning , 2020, ArXiv.

[11]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[12]  DarrellTrevor,et al.  End-to-end training of deep visuomotor policies , 2016 .

[13]  Oliver Kroemer,et al.  Learning to Compose Hierarchical Object-Centric Controllers for Robotic Manipulation , 2020, CoRL.

[14]  Mary M Hayhoe,et al.  Task and context determine where you look. , 2016, Journal of vision.

[15]  Alex Mott,et al.  Towards Interpretable Reinforcement Learning Using Attention Augmented Agents , 2019, NeurIPS.

[16]  Alberto Rodriguez,et al.  TossingBot: Learning to Throw Arbitrary Objects With Residual Physics , 2019, IEEE Transactions on Robotics.

[17]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[18]  Silvio Savarese,et al.  Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations , 2020, Robotics: Science and Systems.

[19]  Anton van den Hengel,et al.  Reinforcement Learning with Attention that Works: A Self-Supervised Approach , 2019, ICONIP.

[20]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[21]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Oussama Khatib,et al.  A unified approach for motion and force control of robot manipulators: The operational space formulation , 1987, IEEE J. Robotics Autom..

[23]  Sergey Levine,et al.  Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Claude Sammut,et al.  A Framework for Behavioural Cloning , 1995, Machine Intelligence 15.

[25]  R. Johansson,et al.  Eye–hand coordination in a sequential target contact task , 2009, Experimental Brain Research.

[26]  R. Johansson,et al.  Eye–Hand Coordination in Object Manipulation , 2001, The Journal of Neuroscience.

[27]  J. Kober,et al.  Learning Interactively to Resolve Ambiguity in Reference Frame Selection , 2020, CoRL.

[28]  Russ Tedrake,et al.  Self-Supervised Correspondence in Visuomotor Policy Learning , 2019, IEEE Robotics and Automation Letters.

[29]  Silvio Savarese,et al.  ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation , 2018, CoRL.

[30]  Jun Nakanishi,et al.  Movement imitation with nonlinear dynamical systems in humanoid robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[31]  Peter Englert,et al.  Learning manipulation skills from a single demonstration , 2018, Int. J. Robotics Res..

[32]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[33]  Yujin Tang,et al.  Neuroevolution of self-interpretable agents , 2020, GECCO.

[34]  Jochen J. Steil,et al.  Automatic selection of task spaces for imitation learning , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[35]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[36]  Thomas Funkhouser,et al.  Grasping in the Wild: Learning 6DoF Closed-Loop Grasping From Low-Cost Demonstrations , 2020, IEEE Robotics and Automation Letters.

[37]  Ken Goldberg,et al.  Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation , 2017, ICRA.

[38]  Szymon Rusinkiewicz,et al.  Spatial Action Maps for Mobile Manipulation , 2020, Robotics: Science and Systems.

[39]  Wei Gao,et al.  kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation , 2019, ISRR.

[40]  Henk Nijmeijer,et al.  Robot Programming by Demonstration , 2010, SIMPAR.

[41]  Ankush Gupta,et al.  Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.

[42]  Silvio Savarese,et al.  KETO: Learning Keypoint Representations for Tool Manipulation , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Andy Zeng,et al.  Form2Fit: Learning Shape Priors for Generalizable Assembly from Disassembly , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Russ Tedrake,et al.  Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation , 2018, CoRL.