Asymmetric self-play for automatic goal discovery in robotic manipulation

We train a single, goal-conditioned policy that can solve many robotic manipulation tasks, including tasks with previously unseen goals and objects. We rely on asymmetric self-play for goal discovery, where two agents, Alice and Bob, play a game. Alice is asked to propose challenging goals and Bob aims to solve them. We show that this method can discover highly diverse and complex goals without any human priors. Bob can be trained with only sparse rewards, because the interaction between Alice and Bob results in a natural curriculum and Bob can learn from Alice’s trajectory when relabeled as a goal-conditioned demonstration. Finally, our method scales, resulting in a single policy that can generalize to many unseen tasks such as setting a table, stacking blocks, and solving simple puzzles. Videos of a learned policy is available at https://robotics-self-play.github.io.

[1]  Joel Lehman,et al.  Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions , 2020, ICML.

[2]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[3]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[4]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[7]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Joonho Lee,et al.  Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[9]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[10]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[11]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[12]  Razvan Pascanu,et al.  Sim-to-Real Robot Learning from Pixels with Progressive Nets , 2016, CoRL.

[13]  Sergey Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[14]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[15]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[16]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[18]  Richard Socher,et al.  Competitive Experience Replay , 2019, ICLR.

[19]  John Schulman,et al.  Teacher–Student Curriculum Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Pierre-Yves Oudeyer,et al.  Active learning of inverse models with intrinsically motivated goal exploration in robots , 2013, Robotics Auton. Syst..

[23]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[24]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[25]  Carl E. Rasmussen,et al.  Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning , 2011, Robotics: Science and Systems.

[26]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[27]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[28]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[29]  Sergey Levine,et al.  Sim-To-Real via Sim-To-Sim: Data-Efficient Robotic Grasping via Randomized-To-Canonical Adaptation Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[31]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[32]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[33]  Pieter Abbeel,et al.  Automatic Curriculum Learning through Value Disagreement , 2020, NeurIPS.

[34]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[35]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[36]  Sergey Levine,et al.  Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[37]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[38]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[39]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[40]  Allan Jabri,et al.  Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[41]  Tim Salimans,et al.  Learning Montezuma's Revenge from a Single Demonstration , 2018, ArXiv.

[42]  Rui Wang,et al.  Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions , 2019, ArXiv.

[43]  Jürgen Schmidhuber,et al.  First Experiments with PowerPlay , 2012, Neural networks : the official journal of the International Neural Network Society.

[44]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[45]  Kenneth O. Stanley,et al.  First return then explore , 2021, Nature.

[46]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[48]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[49]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[50]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[51]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[52]  Andrew K. Lampinen,et al.  Automated curriculum generation through setter-solver interactions , 2020, ICLR.

[53]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[54]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[55]  Rob Fergus,et al.  Learning Goal Embeddings via Self-Play for Hierarchical Reinforcement Learning , 2018, ArXiv.

[56]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.