Human-level performance in first-person multiplayer games with population-based deep reinforcement learning

Recent progress in artificial intelligence through reinforcement learning (RL) has shown great success on increasingly complex single-agent environments (30, 40, 45, 46, 56) and two-player turn-based games (47, 58, 66). However, the realworld contains multiple agents, each learning and acting independently to cooperate and compete with other agents, and environments reflecting this degree of complexity remain an open challenge. In this work, we demonstrate for the first time that an agent can achieve human-level in a popular 3D multiplayer first-person video game, Quake III Arena Capture the Flag (28), using only pixels and game points as input. These results were achieved by a novel two-tier optimisation process in which a population of independent RL agents are trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its own internal reward signal to complement the sparse delayed reward from winning, and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales. During game-play, these agents display humanlike behaviours such as navigating, following, and defending based on a rich learned representation that is shown to encode high-level game knowledge. In an extensive tournament-style evaluation the trained agents exceeded the winrate of strong human players both as teammates and opponents, and proved far stronger than existing state-of-the-art agents. These results demonstrate a

[1]  Jeff Orkin,et al.  Three States and a Plan: The A.I. of F.E.A.R. , 2006 .

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[4]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[5]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[6]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[7]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[8]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.

[11]  Richard L. Lewis,et al.  Intrinsically Motivated Reinforcement Learning: An Evolutionary Perspective , 2010, IEEE Transactions on Autonomous Mental Development.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[14]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[15]  Jürgen Schmidhuber,et al.  Learning Complex, Extended Sequences Using the Principle of History Compression , 1992, Neural Computation.

[16]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[17]  Martin A. Riedmiller,et al.  On Experiences in a Complex and Competitive Gaming Domain: Reinforcement Learning Meets RoboCup , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[18]  E. Hellinger,et al.  Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen. , 1909 .

[19]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[20]  R. Quiroga Concept cells: the building blocks of declarative memory functions , 2012, Nature Reviews Neuroscience.

[21]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[22]  David H. Ackley,et al.  Interactions between learning and evolution , 1991 .

[23]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[24]  Jeremy R. Cooperstock,et al.  On the Limits of the Human Motor Control Precision: The Search for a Device's Human Resolution , 2011, INTERACT.

[25]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[26]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[27]  Ryan P. Adams,et al.  Mapping Sub-Second Structure in Mouse Behavior , 2015, Neuron.

[28]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[29]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[30]  Frans Mäyrä,et al.  Fundamental Components of the Gameplay Experience: Analysing Immersion , 2005, DiGRA Conference.

[31]  Julian Togelius,et al.  Hierarchical controller learning in a First-Person Shooter , 2009, 2009 IEEE Symposium on Computational Intelligence and Games.

[32]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[33]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[34]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[35]  A. Elo The rating of chessplayers, past and present , 1978 .

[36]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[37]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[38]  David Silver,et al.  Reinforced Variational Inference , 2015, NIPS 2015.

[39]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[40]  Kagan Tumer,et al.  An Introduction to Collective Intelligence , 1999, ArXiv.

[41]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[42]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[43]  Peter Stone,et al.  Deep Reinforcement Learning in Parameterized Action Space , 2015, ICLR.

[44]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[45]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[46]  M. A. MacIver,et al.  Neuroscience Needs Behavior: Correcting a Reductionist Bias , 2017, Neuron.

[47]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[48]  J. Pratt,et al.  The effects of action video game experience on the time course of inhibition of return and the efficiency of visual search. , 2005, Acta psychologica.

[49]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[50]  Manuela M. Veloso,et al.  Layered Learning , 2000, ECML.

[51]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[52]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[53]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[54]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[55]  Hiroaki Kitano,et al.  RoboCup: A Challenge Problem for AI and Robotics , 1997, RoboCup.

[56]  Guillaume Lample,et al.  Playing FPS Games with Deep Reinforcement Learning , 2016, AAAI.

[57]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[58]  Richard K. Belew,et al.  New Methods for Competitive Coevolution , 1997, Evolutionary Computation.

[59]  Patrick MacAlpine,et al.  UT Austin Villa: RoboCup 2016 3D Simulation League Competition and Technical Challenges Champions , 2015, Robot Soccer World Cup.

[60]  John E. Laird,et al.  Human-Level AI's Killer Application: Interactive Computer Games , 2000, AI Mag..

[61]  Sarit Kraus,et al.  Ad Hoc Autonomous Agent Teams: Collaboration without Pre-Coordination , 2010, AAAI.

[62]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[63]  C. Honey,et al.  Processing Timescales as an Organizing Principle for Primate Cortex , 2015, Neuron.