Emergent Tool Use From Multi-Agent Autocurricula

Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.

[1]  A. Elo The rating of chessplayers, past and present , 1978 .

[2]  J. Krebs,et al.  Arms races between and within species , 1979, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[3]  E. Menzel Animal Tool Behavior: The Use and Manufacture of Tools by Animals, Benjamin B. Beck. Garland STPM Press, New York and London (1980), 306, Price £24.50 , 1981 .

[4]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[5]  T. Ray Evolution , Ecology and Optimization of Digital Organisms , 1992 .

[6]  Karl Sims,et al.  Evolving virtual creatures , 1994, SIGGRAPH.

[7]  Karl Sims,et al.  Evolving 3d morphology and behavior by competition , 1994 .

[8]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[9]  Richard K. Belew,et al.  Methods for Competitive Co-Evolution: Finding Opponents Worth Beating , 1995, ICGA.

[10]  Jan Paredis,et al.  Coevolutionary Computation , 1995, Artificial Life.

[11]  G. Hunt Manufacture and use of hook-tools by New Caledonian crows , 1996, Nature.

[12]  Jordan B. Pollack,et al.  Coevolution of a Backgammon Player , 1996 .

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Richard K. Belew,et al.  New Methods for Competitive Coevolution , 1997, Evolutionary Computation.

[15]  L. Yaeger Computational Genetics, Physiology, Metabolism, Neural Systems, Learning, Vision, and Behavior or PolyWorld: Life in a New Context , 1997 .

[16]  Robert I. Damper,et al.  Perpetuating evolutionary emergence , 1998 .

[17]  Jordan B. Pollack,et al.  A Game-Theoretic Memory Mechanism for Coevolution , 2003, GECCO.

[18]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[19]  Charles Ofria,et al.  Avida , 2004, Artificial Life.

[20]  Risto Miikkulainen,et al.  Competitive Coevolution through Evolutionary Complexification , 2011, J. Artif. Intell. Res..

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[23]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[24]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[25]  Andre Cohen,et al.  An object-oriented representation for efficient reinforcement learning , 2008, ICML '08.

[26]  Richard L. Lewis,et al.  Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[27]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Kenneth O. Stanley,et al.  Identifying Necessary Conditions for Open-Ended Evolution through the Artificial Life World of Chromaria , 2014, ALIFE.

[29]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[30]  Tim Taylor,et al.  Requirements for Open-Ended Evolution in Natural and Artificial Systems , 2015, ArXiv.

[31]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[32]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[33]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[34]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[35]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[36]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[37]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[38]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[39]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[40]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[41]  Pierre-Yves Oudeyer,et al.  Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[45]  Marcin Andrychowicz,et al.  One-Shot Imitation Learning , 2017, NIPS.

[46]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[47]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[48]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[49]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[50]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[51]  S. Shankar Sastry,et al.  Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning , 2017, ArXiv.

[52]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[53]  Joel Z. Leibo,et al.  A multi-agent reinforcement learning model of common-pool resource appropriation , 2017, NIPS.

[54]  Daniel L. K. Yamins,et al.  Learning to Play with Intrinsically-Motivated Self-Aware Agents , 2018, NeurIPS.

[55]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[56]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[57]  Razvan Pascanu,et al.  Relational Deep Reinforcement Learning , 2018, ArXiv.

[58]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[59]  Risto Miikkulainen,et al.  The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities , 2018, Artificial Life.

[60]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[61]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[62]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[63]  Joel Z. Leibo,et al.  Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research , 2019, ArXiv.

[64]  Jessica B. Hamrick,et al.  Structured agents for physical construction , 2019, ICML.

[65]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[66]  Guy Lever,et al.  Emergent Coordination Through Competition , 2019, ICLR.

[67]  Rui Wang,et al.  Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions , 2019, ArXiv.

[68]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[69]  Joel Z. Leibo,et al.  Malthusian Reinforcement Learning , 2018, AAMAS.

[70]  Joshua B. Tenenbaum,et al.  The Tools Challenge: Rapid Trial-and-Error Learning in Physical Problem Solving , 2019, CogSci.

[71]  Nando de Freitas,et al.  Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[72]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[73]  Sergey Levine,et al.  Improvisation through Physical Understanding: Using Novel Objects as Tools with Visual Foresight , 2019, Robotics: Science and Systems.

[74]  Kevin A. Smith,et al.  Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning , 2019, Proceedings of the National Academy of Sciences.

[75]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[76]  De,et al.  Relational Reinforcement Learning , 2022 .