An improved approach to reinforcement learning in Computer Go

Monte-Carlo Tree Search (MCTS) has revolutionized, Computer Go, with programs based on the algorithm, achieving a level of play that previously seemed decades away., However, since the technique involves constructing a search tree, its performance tends to degrade in larger state spaces. Dyna-2, is a hybrid approach that attempts to overcome this shortcoming, by combining Monte-Carlo methods with state abstraction. While, not competitive with the strongest MCTS-based programs, the, Dyna-2-based program RLGO achieved the highest ever rating, by a traditional program on the 9×9 Computer Go Server. Plain, Dyna-2 uses _-greedy exploration and a flat learning rate, but we, show that the performance of the algorithm can be significantly, improved by making some relatively minor adjustments to this, configuration. Our strongest modified program achieved an Elo, rating 289 points higher than the original in head-to-head play, equivalent to an expected win rate of 84%.

[1]  David Silver,et al.  Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Achieving Master Level Play in 9 × 9 Computer Go , 2022 .

[2]  David Silver,et al.  Reinforcement learning and simulation-based search in computer go , 2009 .

[3]  A. Elo The rating of chessplayers, past and present , 1978 .

[4]  Martin Müller,et al.  Computer Go , 2002, Artif. Intell..

[5]  David Silver,et al.  Move Evaluation in Go Using Deep Convolutional Neural Networks , 2014, ICLR.

[6]  Bernd Brügmann Max-Planck Monte Carlo Go , 1993 .

[7]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[8]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[9]  Martin Müller,et al.  Fuego—An Open-Source Framework for Board Games and Go Engine Based on Monte Carlo Tree Search , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[10]  Richard S. Sutton,et al.  Reinforcement Learning of Local Shape in the Game of Go , 2007, IJCAI.

[11]  Sylvain Gelly,et al.  Modifications of UCT and sequence-like simulations for Monte-Carlo Go , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[12]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[14]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[15]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[16]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Michael I. Jordan Why the logistic function? A tutorial discussion on probabilities and neural networks , 1995 .