Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design

A wide range of reinforcement learning (RL) problems - including robustness, transfer learning, unsupervised RL, and emergent complexity - require specifying a distribution of tasks or environments in which a policy will be trained. However, creating a useful distribution of environments is error prone, and takes a significant amount of developer time and effort. We propose Unsupervised Environment Design (UED) as an alternative paradigm, where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. Existing approaches to automatically generating environments suffer from common failure modes: domain randomization cannot generate structure or adapt the difficulty of the environment to the agent's learning progress, and minimax adversarial training leads to worst-case environments that are often unsolvable. To generate structured, solvable environments for our protagonist agent, we introduce a second, antagonist agent that is allied with the environment-generating adversary. The adversary is motivated to generate environments which maximize regret, defined as the difference between the protagonist and antagonist agent's return. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED). Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.

[1]  Greg Turk,et al.  Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[2]  Kris M. Kitani,et al.  VADRA: Visual Adversarial Domain Randomization and Augmentation , 2018, ArXiv.

[3]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[4]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[5]  Mi-Ching Tsai,et al.  Robust and Optimal Control , 2014 .

[6]  Julian Togelius,et al.  Rotation, Translation, and Cropping for Zero-Shot Generalization , 2020, 2020 IEEE Conference on Games (CoG).

[7]  Joel Z. Leibo,et al.  Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research , 2019, ArXiv.

[8]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[9]  Lakmal Seneviratne,et al.  Adaptive Control Of Robot Manipulators , 1992, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Sergey Levine,et al.  Unsupervised Meta-Learning for Reinforcement Learning , 2018, ArXiv.

[11]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[12]  James Davidson,et al.  Supervision via competition: Robot adversaries for learning tasks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Karl Johan Åström,et al.  Theory and applications of adaptive control - A survey , 1983, Autom..

[14]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[15]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[16]  Atil Iscen,et al.  Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[17]  Danica Kragic,et al.  Reinforcement Learning for Pivoting Task , 2017, ArXiv.

[18]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[19]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[20]  Christopher Joseph Pal,et al.  Active Domain Randomization , 2019, CoRL.

[21]  Martin Peterson,et al.  An Introduction to Decision Theory , 2009 .

[22]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[24]  Christos H. Papadimitriou,et al.  Games against nature , 1985, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[25]  Andrew Y. Ng,et al.  Solving Uncertain Markov Decision Processes , 2001 .

[26]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  John Schulman,et al.  Teacher–Student Curriculum Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Igor Mordatch,et al.  Emergent Tool Use From Multi-Agent Autocurricula , 2019, ICLR.

[30]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Joshua B. Tenenbaum,et al.  Learning with AMIGo: Adversarially Motivated Intrinsic Goals , 2020, ICLR.

[32]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Dima Damen,et al.  Egocentric Real-time Workspace Monitoring using an RGB-D camera , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[35]  Michael A. Osborne,et al.  The future of employment: How susceptible are jobs to computerisation? , 2017 .

[36]  Joel Lehman,et al.  Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions , 2020, ICML.

[37]  Sergey Levine,et al.  Adversarial Policies: Attacking Deep Reinforcement Learning , 2019, ICLR.

[38]  Craig Boutilier,et al.  Robust Online Optimization of Reward-Uncertain MDPs , 2011, IJCAI.

[39]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[40]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[41]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[42]  S. Shankar Sastry,et al.  On Gradient-Based Learning in Continuous Games , 2018, SIAM J. Math. Data Sci..

[43]  S. Chaiklin The zone of proximal development in Vygotsky's analysis of learning and instruction. , 2003 .

[44]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[45]  Leonard J. Savage,et al.  The Theory of Statistical Decision , 1951 .

[46]  Rui Wang,et al.  Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions , 2019, ArXiv.

[47]  Nick Jakobi,et al.  Evolutionary Robotics and the Radical Envelope-of-Noise Hypothesis , 1997, Adapt. Behav..

[48]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.