Dota 2 with Large Scale Deep Reinforcement Learning

On April 13th, 2019, OpenAI Five became the first AI system to defeat the world champions at an esports game. The game of Dota 2 presents novel challenges for AI systems such as long time horizons, imperfect information, and complex, continuous state-action spaces, all challenges which will become increasingly central to more capable AI systems. OpenAI Five leveraged existing reinforcement learning techniques, scaled to learn from batches of approximately 2 million frames every 2 seconds. We developed a distributed training system and tools for continual training which allowed us to train OpenAI Five for 10 months. By defeating the Dota 2 world champion (Team OG), OpenAI Five demonstrates that self-play reinforcement learning can achieve superhuman performance on a difficult task.

[1]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[2]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[3]  L. V. Allis,et al.  Searching for solutions in games and artificial intelligence , 1994 .

[4]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[5]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[6]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[7]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[8]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[9]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[10]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[11]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[12]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[13]  Aditya Jain,et al.  A comparative study of visual and auditory reaction times on the basis of gender and physical activity levels of medical first year students , 2015, International journal of applied & basic medical research.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[16]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[17]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[19]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[20]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[21]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[22]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[23]  Martial Hebert,et al.  Growing a Brain: Fine-Tuning by Increasing Model Capacity , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[25]  Diederik P. Kingma,et al.  GPU Kernels for Block-Sparse Weights , 2017 .

[26]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[27]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[28]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[29]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[30]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[31]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[32]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[33]  Ilya Kostrikov,et al.  Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play , 2017, ICLR.

[34]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Joel Z. Leibo,et al.  Human-level performance in first-person multiplayer games with population-based deep reinforcement learning , 2018, ArXiv.

[36]  Yee Whye Teh,et al.  Mix&Match - Agent Curricula for Reinforcement Learning , 2018, ICML.

[37]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[38]  Dario Amodei,et al.  An Empirical Model of Large-Batch Training , 2018, ArXiv.

[39]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[40]  Max Jaderberg,et al.  Open-ended Learning in Symmetric Zero-sum Games , 2019, ICML.

[41]  Marcin Andrychowicz,et al.  Solving Rubik's Cube with a Robot Hand , 2019, ArXiv.

[42]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[43]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[44]  Katja Hofmann,et al.  The MineRL Competition on Sample Efficient Reinforcement Learning using Human Priors , 2019, ArXiv.

[45]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[46]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.