Enhancing upper confidence bounds for trees with temporal difference values

Upper confidence bounds for trees (UCT) is one of the most popular and generally effective Monte Carlo tree search (MCTS) algorithms. However, in practice it is relatively weak when not aided by additional enhancements. Improving its performance without reducing generality is a current research challenge. We introduce a new domain-independent UCT enhancement based on the theory of reinforcement learning. Our approach estimates state values in the UCT tree by employing temporal difference (TD) learning, which is known to outperform plain Monte Carlo sampling in certain domains. We present three adaptations of the TD(λ) algorithm to the UCT's tree policy and backpropagation step. Evaluations on four games (Gomoku, Hex, Connect Four, and Tic Tac Toe) reveal that our approach increases UCT's level of play comparably to the rapid action value estimation (RAVE) enhancement. Furthermore, it proves highly compatible with a modified all moves as first heuristic, where it considerably outperforms RAVE. The findings suggest that integration of TD learning into MCTS deserves further research, which may form a new class of MCTS enhancements.

[1]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[2]  Richard J. Lorentz Amazons Discover Monte-Carlo , 2008, Computers and Games.

[3]  Levente Kocsis,et al.  Transpositions and move groups in Monte Carlo tree search , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[4]  Richard S. Sutton,et al.  Temporal-difference search in computer Go , 2012, Machine Learning.

[5]  Zhiqing Liu,et al.  Backpropagation Modification in Monte-Carlo Game Tree Search , 2009, 2009 Third International Symposium on Intelligent Information Technology Application.

[6]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[7]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[9]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  Yasuhiro Tajima,et al.  An Othello evaluation function based on Temporal Difference Learning using probability of winning , 2008, 2008 IEEE Symposium On Computational Intelligence and Games.

[12]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[13]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[14]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[15]  David P. Helmbold,et al.  All-Moves-As-First Heuristics in Monte-Carlo Go , 2009, IC-AI.

[16]  Ryan B. Hayward,et al.  Monte Carlo Tree Search in Hex , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[17]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[18]  Yngvi Björnsson,et al.  Simulation-Based Approach to General Game Playing , 2008, AAAI.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Peter I. Cowling,et al.  Ensemble Determinization in Monte Carlo Tree Search for the Imperfect Information Card Game Magic: The Gathering , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[21]  Gerald Tesauro,et al.  Monte-Carlo simulation balancing , 2009, ICML '09.