Mastering the game of Go without human knowledge

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

[1]  R. Bellman,et al.  Dynamic Programming and Markov Processes. , 1961 .

[2]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[3]  Gerald Tesauro,et al.  Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[4]  Andrew G. Barto,et al.  Monte Carlo Matrix Inversion and Reinforcement Learning , 1993, NIPS.

[5]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[6]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[7]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[8]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[9]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[10]  M. Enzenberger The Integration of A Priori Knowledge into a Go Playing Neural Network , 1996 .

[11]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[12]  Michael Buro,et al.  From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[13]  Richard Hans Robert Hahnloser,et al.  Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit , 2000, Nature.

[14]  Jonathan Schaeffer,et al.  Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[15]  Martin Müller,et al.  Computer Go , 2002, Artif. Intell..

[16]  Haixun Wang,et al.  Empirical comparison of various reinforcement learning strategies for sequential targeted marketing , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[17]  Brian Sheppard,et al.  World-championship-caliber Scrabble , 2002, Artif. Intell..

[18]  Markus Enzenberger Evaluation in Go by a Neural Network using Soft Segmentation , 2003, ACG.

[19]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[20]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[21]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[22]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[24]  Richard S. Sutton,et al.  Learning to Predict by the Methods of Temporal Differences , 1988, Machine Learning.

[25]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[26]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[27]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[28]  Jacek Mandziuk,et al.  Computational Intelligence in Mind Games , 2007, Challenges for Computational Intelligence.

[29]  Rémi Coulom Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[30]  Rémi Coulom,et al.  Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength , 2008, Computers and Games.

[31]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[32]  Flavien Balbo,et al.  Using a monte-carlo approach for bus regulation , 2009, 2009 12th International IEEE Conference on Intelligent Transportation Systems.

[33]  David Silver,et al.  Reinforcement learning and simulation-based search in computer go , 2009 .

[34]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[35]  Z. Q. John Lu,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[36]  N. Le Fort-Piat,et al.  The world of independent learners is not markovian , 2011, Int. J. Knowl. Based Intell. Eng. Syst..

[37]  Christopher D. Rosin,et al.  Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[38]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[39]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[40]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[43]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[44]  David Silver,et al.  Concurrent Reinforcement Learning from Customer Interactions , 2013, ICML.

[45]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[46]  Bruno Scherrer Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[47]  David Silver,et al.  Move Evaluation in Go Using Deep Convolutional Neural Networks , 2014, ICLR.

[48]  Matthew Lai Giraffe: Using Deep Reinforcement Learning to Play Chess , 2015, ArXiv.

[49]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[50]  Yoshua Bengio,et al.  Deep Learning , 2015, Nature.

[51]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[52]  Amos J. Storkey,et al.  Training Deep Convolutional Neural Networks to Play Go , 2014, ICML.

[53]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[54]  Yuandong Tian,et al.  Better Computer Go Player with Neural Network and Long-term Prediction , 2016, ICLR.

[55]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[56]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[58]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[59]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[60]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[61]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.

[62]  Vladlen Koltun,et al.  Learning to Act by Predicting the Future , 2016, ICLR.

[63]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[64]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2017, ICLR.

[65]  Tristan Cazenave Residual Networks for Computer Go , 2018, IEEE Transactions on Games.