Replay-Guided Adversarial Environment Design

Deep reinforcement learning (RL) agents may successfully generalize to new settings if trained on an appropriately diverse set of environment and task configurations. Unsupervised Environment Design (UED) is a promising selfsupervised RL paradigm, wherein the free parameters of an underspecified environment are automatically adapted during training to the agent’s capabilities, leading to the emergence of diverse training environments. Here, we cast Prioritized Level Replay (PLR), an empirically successful but theoretically unmotivated method that selectively samples randomly-generated training levels, as UED. We argue that by curating completely random levels, PLR, too, can generate novel and complex levels for effective training. This insight reveals a natural class of UED methods we call Dual Curriculum Design (DCD). Crucially, DCD includes both PLR and a popular UED algorithm, PAIRED, as special cases and inherits similar theoretical guarantees. This connection allows us to develop novel theory for PLR, providing a version with a robustness guarantee at Nash equilibria. Furthermore, our theory suggests a highly counterintuitive improvement to PLR: by stopping the agent from updating its policy on uncurated levels (training on less data), we can improve the convergence to Nash equilibria. Indeed, our experiments confirm that our new method, PLR⊥, obtains better results on a suite of out-of-distribution, zero-shot transfer tasks, in addition to demonstrating that PLR⊥ improves the performance of PAIRED, from which it inherited its theoretical framework.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Silvio Savarese,et al.  Adaptive Procedural Task Generation for Hard-Exploration Problems , 2021, ICLR.

[3]  Rui Wang,et al.  Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions , 2019, ArXiv.

[4]  Julian Togelius,et al.  Procedural Content Generation: From Automatically Generating Game Levels to Increasing Generality in Machine Learning , 2019, ArXiv.

[5]  Nick Jakobi,et al.  Evolutionary Robotics and the Radical Envelope-of-Noise Hypothesis , 1997, Adapt. Behav..

[6]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[7]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[8]  Allan Jabri,et al.  Unsupervised Curricula for Visual Meta-Reinforcement Learning , 2019, NeurIPS.

[9]  Leonard J. Savage,et al.  The Theory of Statistical Decision , 1951 .

[10]  J. Schulman,et al.  Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[11]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[12]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[13]  Jürgen Schmidhuber,et al.  PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem , 2011, Front. Psychol..

[14]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Christopher Joseph Pal,et al.  Active Domain Randomization , 2019, CoRL.

[16]  Joel Lehman,et al.  Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions , 2020, ICML.

[17]  Michael E. Mortenson Mathematics for Computer Graphics Applications , 1999 .

[18]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[19]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[20]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[21]  Andrew K. Lampinen,et al.  Automated curriculum generation through setter-solver interactions , 2020, ICLR.

[22]  Pieter Abbeel,et al.  Automatic Curriculum Learning through Value Disagreement , 2020, NeurIPS.

[23]  Yujin Tang,et al.  Neuroevolution of self-interpretable agents , 2020, GECCO.

[24]  Sergey Levine,et al.  Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design , 2020, NeurIPS.

[25]  Wojciech Zaremba,et al.  Asymmetric self-play for automatic goal discovery in robotic manipulation , 2021, ArXiv.

[26]  Joelle Pineau,et al.  A Dissection of Overfitting and Generalization in Continuous Reinforcement Learning , 2018, ArXiv.

[27]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[29]  Joshua B. Tenenbaum,et al.  Learning with AMIGo: Adversarially Motivated Intrinsic Goals , 2020, ICLR.

[30]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[31]  John Schulman,et al.  Teacher–Student Curriculum Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[32]  Andrew J. Davison,et al.  Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task , 2017, CoRL.

[33]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[34]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[35]  Pierre-Yves Oudeyer,et al.  Automatic Curriculum Learning For Deep RL: A Short Survey , 2020, IJCAI.

[36]  Matthew E. Taylor,et al.  Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey , 2020, J. Mach. Learn. Res..

[37]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[38]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[39]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[40]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[41]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[42]  Edward Grefenstette,et al.  Prioritized Level Replay , 2020, ICML.