论文信息 - Temporal-difference search in computer Go

Temporal-difference search in computer Go

Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9×9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9×9 Computer Go Server.

[1] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[2] S. Haykin,et al. Adaptive Filter Theory , 1986 .

[3] S. Thomas Alexander,et al. Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[4] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[5] Terrence J. Sejnowski,et al. Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[6] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[7] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[8] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[9] Michael I. Jordan. Why the logistic function? A tutorial discussion on probabilities and neural networks , 1995 .

[10] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[11] Gerald Tesauro,et al. On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[12] M. Enzenberger. The Integration of A Priori Knowledge into a Go Playing Neural Network , 1996 .

[13] Michael L. Littman,et al. Algorithms for Sequential Decision Making , 1996 .

[14] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[15] Jonathan Schaeffer,et al. One jump ahead - challenging human supremacy in checkers , 1997, J. Int. Comput. Games Assoc..

[16] Ken Chen,et al. Machine Learning, Game Play, and Go , 1998 .

[17] Michael Buro,et al. From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[18] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[19] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20] Jonathan Schaeffer,et al. The games computers (and people) play , 2000, Adv. Comput..

[21] Johannes Fürnkranz,et al. Machine learning in games: a survey , 2001 .

[22] Jonathan Schaeffer,et al. Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[23] Fredrik A. Dahl,et al. Honte, a go-playing program using neural nets , 2001 .

[24] Martin Müller,et al. Computer Go , 2002, Artif. Intell..

[25] John N. Tsitsiklis,et al. On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[26] Eric O. Postma,et al. Local Move Prediction in Go , 2002, Computers and Games.

[27] Markus Enzenberger,et al. Evaluation in Go by a Neural Network using Soft Segmentation , 2003, ACG.

[28] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[29] Peter Dayan,et al. Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[30] Terrence J. Sejnowski,et al. TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[31] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[32] Andrew Tridgell,et al. Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[33] M. F. Prottsman. Shape Up , 2004 .

[34] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35] Thore Graepel,et al. Bayesian pattern ranking for move prediction in the game of Go , 2006, ICML.

[36] Olivier Teytaud,et al. Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[37] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[38] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[39] David Silver,et al. Combining online and offline knowledge in UCT , 2007, ICML '07.

[40] Rémi Coulom,et al. Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[41] Helmut A. Mayer,et al. Board Representations for Neural Go Players Learning by Temporal Difference , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[42] David Silver,et al. Combining Online and Offline Learning in UCT , 2007 .

[43] Richard S. Sutton,et al. Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[44] S. Gelly,et al. Combining expert, offline, transient and online knowledge in Monte-Carlo exploration , 2008 .

[45] Rémi Coulom,et al. Whole-History Rating: A Bayesian Rating System for Players of Time-Varying Strength , 2008, Computers and Games.

[46] Yngvi Björnsson,et al. Simulation-Based Approach to General Game Playing , 2008, AAAI.

[47] Richard J. Lorentz. Amazons Discover Monte-Carlo , 2008, Computers and Games.

[48] Nathan R. Sturtevant,et al. An Analysis of UCT in Multi-Player Games , 2008, J. Int. Comput. Games Assoc..

[49] Gerald Tesauro,et al. Monte-Carlo simulation balancing , 2009, ICML '09.

[50] Joel Veness,et al. Bootstrapping from Game Tree Search , 2009, NIPS.

[51] Alan Fern,et al. UCT for Tactical Assault Planning in Real-Time Strategy Games , 2009, IJCAI.

[52] Mark H. M. Winands,et al. Evaluation Function Based Monte-Carlo LOA , 2009, ACG.

[53] David Silver,et al. Reinforcement Learning and Simulation Based Search in the Game of Go , 2009 .

[54] Martin Müller,et al. Fuego—An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[55] Shih-Chieh Huang,et al. Monte-Carlo Simulation Balancing in Practice , 2010, Computers and Games.

[56] David Silver,et al. Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..