Temporal-difference search in computer Go

Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9×9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9×9 Computer Go Server.

[1]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[2]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .

[3]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[4]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[5]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[6]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[7]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[8]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[9]  Michael I. Jordan Why the logistic function? A tutorial discussion on probabilities and neural networks , 1995 .

[10]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[11]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[12]  M. Enzenberger The Integration of A Priori Knowledge into a Go Playing Neural Network , 1996 .

[13]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Jonathan Schaeffer,et al.  One jump ahead - challenging human supremacy in checkers , 1997, J. Int. Comput. Games Assoc..

[16]  Ken Chen,et al.  Machine Learning, Game Play, and Go , 1998 .

[17]  Michael Buro,et al.  From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[19]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20]  Jonathan Schaeffer,et al.  The games computers (and people) play , 2000, Adv. Comput..

[21]  Johannes Fürnkranz,et al.  Machine learning in games: a survey , 2001 .

[22]  Jonathan Schaeffer,et al.  Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[23]  Fredrik A. Dahl,et al.  Honte, a go-playing program using neural nets , 2001 .

[24]  Martin Müller,et al.  Computer Go , 2002, Artif. Intell..

[25]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[26]  Eric O. Postma,et al.  Local Move Prediction in Go , 2002, Computers and Games.

[27]  Markus Enzenberger,et al.  Evaluation in Go by a Neural Network using Soft Segmentation , 2003, ACG.

[28]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[29]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[30]  Terrence J. Sejnowski,et al.  TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[31]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[32]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[33]  M. F. Prottsman Shape Up , 2004 .

[34]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35]  Thore Graepel,et al.  Bayesian pattern ranking for move prediction in the game of Go , 2006, ICML.

[36]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[37]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[38]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[39]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[40]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[41]  Helmut A. Mayer,et al.  Board Representations for Neural Go Players Learning by Temporal Difference , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[42]  David Silver,et al.  Combining Online and Offline Learning in UCT , 2007 .

[43]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[44]  S. Gelly,et al.  Combining expert, offline, transient and online knowledge in Monte-Carlo exploration , 2008 .

[45]  Rémi Coulom,et al.  Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength , 2008, Computers and Games.

[46]  Yngvi Björnsson,et al.  Simulation-Based Approach to General Game Playing , 2008, AAAI.

[47]  Richard J. Lorentz Amazons Discover Monte-Carlo , 2008, Computers and Games.

[48]  Nathan R. Sturtevant,et al.  An Analysis of UCT in Multi-Player Games , 2008, J. Int. Comput. Games Assoc..

[49]  Gerald Tesauro,et al.  Monte-Carlo simulation balancing , 2009, ICML '09.

[50]  Joel Veness,et al.  Bootstrapping from Game Tree Search , 2009, NIPS.

[51]  Alan Fern,et al.  UCT for Tactical Assault Planning in Real-Time Strategy Games , 2009, IJCAI.

[52]  Mark H. M. Winands,et al.  Evaluation Function Based Monte-Carlo LOA , 2009, ACG.

[53]  David Silver,et al.  Reinforcement Learning and Simulation Based Search in the Game of Go , 2009 .

[54]  Martin Müller,et al.  Fuego—An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[55]  Shih-Chieh Huang,et al.  Monte-Carlo Simulation Balancing in Practice , 2010, Computers and Games.

[56]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..