Mastering the game of Go with deep neural networks and tree search

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

[1]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[2]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[3]  Donald E. Knuth,et al.  The Solution for the Branching Factor of the Alpha-Beta Pruning Algorithm , 1981, ICALP.

[4]  Hans J. Berliner,et al.  A Chronology of Computer Chess and its Literature , 1978, Artif. Intell..

[5]  Jonathan Schaeffer,et al.  A World Championship Caliber Checkers Program , 1992, Artif. Intell..

[6]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[7]  L. V. Allis,et al.  Searching for solutions in games and artificial intelligence , 1994 .

[8]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[9]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[10]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[11]  Ah Chung Tsoi,et al.  Face recognition: a convolutional neural-network approach , 1997, IEEE Trans. Neural Networks.

[12]  Michael Buro,et al.  From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[13]  D. A. Mechner,et al.  All Systems Go , 1998 .

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  Jonathan Schaeffer,et al.  The games computers (and people) play , 2000, Adv. Comput..

[16]  Jonathan Schaeffer,et al.  Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[17]  Fredrik A. Dahl,et al.  Honte, a go-playing program using neural nets , 2001 .

[18]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[19]  Martin Müller,et al.  Computer Go , 2002, Artif. Intell..

[20]  Brian Sheppard,et al.  World-championship-caliber Scrabble , 2002, Artif. Intell..

[21]  H. Jaap van den Herik,et al.  Games solved: Now and in the future , 2002, Artif. Intell..

[22]  Markus Enzenberger,et al.  Evaluation in Go by a Neural Network using Soft Segmentation , 2003, ACG.

[23]  Bruno Bouzy,et al.  Monte-Carlo Go Developments , 2003, ACG.

[24]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[25]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[28]  Thore Graepel,et al.  Bayesian pattern ranking for move prediction in the game of Go , 2006, ICML.

[29]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[30]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[31]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[32]  Jacek Mandziuk,et al.  Computational Intelligence in Mind Games , 2007, Challenges for Computational Intelligence.

[33]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[34]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[35]  David Silver,et al.  Combining Online and Offline Learning in UCT , 2007 .

[36]  Rémi Coulom,et al.  Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength , 2008, Computers and Games.

[37]  Ilya Sutskever,et al.  Mimicking Go Experts with Convolutional Neural Networks , 2008, ICANN.

[38]  Gerald Tesauro,et al.  Monte-Carlo simulation balancing , 2009, ICML '09.

[39]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[40]  Martin Müller,et al.  A Lock-Free Multithreaded Monte-Carlo Tree Search Algorithm , 2009, ACG.

[41]  Hendrik Baier,et al.  The Power of Forgetting: Improving the Last-Good-Reply Policy in Monte Carlo Go , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[42]  Martin Müller,et al.  Fuego—An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[43]  Shih-Chieh Huang,et al.  Monte-Carlo Simulation Balancing in Practice , 2010, Computers and Games.

[44]  Shih-Chieh Huang,et al.  Time Management for Monte-Carlo Tree Search Applied to the Game of Go , 2010, 2010 International Conference on Technologies and Applications of Artificial Intelligence.

[45]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[46]  Mark H. M. Winands,et al.  Active Opening Book Application for Monte-Carlo Tree Search in 19×19 Go , 2011 .

[47]  Petr Baudis,et al.  Balancing MCTS by Dynamically Adjusting the Komi Value , 2011, J. Int. Comput. Games Assoc..

[48]  Petr Baudis,et al.  PACHI: State of the Art Open Source Go Program , 2011, ACG.

[49]  Christopher D. Rosin,et al.  Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[50]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[51]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[52]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[53]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[54]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[55]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[56]  Shih-Chieh Huang,et al.  Investigating the Limits of Monte-Carlo Tree Search Methods in Computer Go , 2013, Computers and Games.

[57]  Nathan R. Sturtevant,et al.  Monte Carlo Tree Search with heuristic evaluations using implicit minimax backups , 2014, 2014 IEEE Conference on Computational Intelligence and Games.

[58]  David Silver,et al.  Move Evaluation in Go Using Deep Convolutional Neural Networks , 2014, ICLR.

[59]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[60]  Amos J. Storkey,et al.  Training Deep Convolutional Neural Networks to Play Go , 2015, ICML.

[61]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[62]  Liuqing Yang,et al.  Where does AlphaGo go: from church-turing thesis to AlphaGo thesis and beyond , 2016, IEEE/CAA Journal of Automatica Sinica.

[63]  Lars Chittka,et al.  Faculty Opinions recommendation of Mastering the game of Go with deep neural networks and tree search. , 2016 .

[64]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[65]  Steffen Hölldobler,et al.  Lessons Learned from AlphaGo , 2017, YSIP.

[66]  一樹 美添,et al.  5分で分かる! ? 有名論文ナナメ読み:Silver, D. et al. : Mastering the Game of Go without Human Knowledge , 2018 .

[67]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.