Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.

[1]  Claude E. Shannon,et al.  Programming a computer for playing chess , 1950 .

[2]  Claude E. Shannon,et al.  XXII. Programming a Computer for Playing Chess 1 , 1950 .

[3]  Emmanuel Lasker,et al.  Common Sense in Chess , 1965 .

[4]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[5]  Donald E. Knuth,et al.  An Analysis of Alpha-Beta Pruning , 1975, Artif. Intell..

[6]  Wilhelm Steinitz,et al.  The Modern Chess Instructor , 1984 .

[7]  Monty Newborn,et al.  How Computers Play Chess , 1990, J. Int. Comput. Games Assoc..

[8]  Tony Marsland,et al.  COMPUTER CHESS METHODS , 1990 .

[9]  L. V. Allis,et al.  Searching for solutions in games and artificial intelligence , 1994 .

[10]  Sebastian Thrun,et al.  Learning to Play the Game of Chess , 1994, NIPS.

[11]  B. Pell A STRATEGIC METAGAME PLAYER FOR GENERAL CHESS‐LIKE GAMES , 1994, Comput. Intell..

[12]  Donald F. Beal,et al.  Temporal Difference Learning for Heuristic Search and Game Playing , 2000, Inf. Sci..

[13]  Donald F. Beal,et al.  Temporal difference learning applied to game playing and the results of application to Shogi , 2001, Theor. Comput. Sci..

[14]  Hiroyuki Iida,et al.  Computer shogi , 2002, Artif. Intell..

[15]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[16]  Feng-Hsiung Hsu,et al.  Behind Deep Blue: Building the Computer that Defeated the World Chess Champion , 2002 .

[17]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[18]  Michael R. Genesereth,et al.  General Game Playing: Overview of the AAAI Competition , 2005, AI Mag..

[19]  Rémi Coulom,et al.  Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength , 2008, Computers and Games.

[20]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[21]  Bart Selman,et al.  Understanding Sampling Style Adversarial Search Methods , 2010, UAI.

[22]  Tomoyuki Kaneko,et al.  Analysis of Evaluation-Function Learning by Comparison of Sibling Nodes , 2011, ACG.

[23]  Tomoyuki Kaneko,et al.  Large-Scale Optimization for Evaluation Functions with Minimax Search , 2014, J. Artif. Intell. Res..

[24]  David Silver,et al.  Move Evaluation in Go Using Deep Convolutional Neural Networks , 2014, ICLR.

[25]  Matthew Lai,et al.  Giraffe: Using Deep Reinforcement Learning to Play Chess , 2015, ArXiv.

[26]  Nathan S. Netanyahu,et al.  DeepChess: End-to-End Deep Neural Network for Automatic Learning in Chess , 2016, ICANN.

[27]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[28]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[29]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[30]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.