Cross-Entropy for Monte-Carlo Tree Search

Recently, Monte-Carlo Tree Search (MCTS) has become a popular approach for intelligent play in games. Amongst others, it is successfully used in most state-of-the-art Go programs. To improve the playing strength of these Go programs any further, many parameters dealing with MCTS should be fine-tuned. In this paper, we propose to apply the Cross-Entropy Method (CEM) for this task. The method is comparable to Estimation-of-Distribution Algorithms (EDAs), a new area of evolutionary computation. We tested CEM by tuning various types of parameters in our Go program MANGO. The experiments were performed in matches against the open-source program GNU GO. They revealed that a program with the CEM-tuned parameters played better than without. Moreover, MANGO plus CEM outperformed the regular MANGO for various time settings and board sizes. From the results we may conclude that parameter tuning by CEM genuinely improved the playing strength of MANGO, for various time settings. This result may be generalized to other game engines using MCTS.

[1]  Tristan Cazenave,et al.  Playing the Right Atari , 2007, J. Int. Comput. Games Assoc..

[2]  Heinz Mühlenbein,et al.  The Equation for Response to Selection and Its Use for Prediction , 1997, Evolutionary Computation.

[3]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[4]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[5]  T. Anthony Marsland,et al.  Learning extension parameters in game-tree search , 2003, Inf. Sci..

[6]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[7]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[8]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[9]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[10]  Donald E. Knuth,et al.  An Analysis of Alpha-Beta Pruning , 1975, Artif. Intell..

[11]  Bruno Bouzy,et al.  Monte-Carlo Go Reinforcement Learning Experiments , 2006, 2006 IEEE Symposium on Computational Intelligence and Games.

[12]  Jonathan Schaeffer,et al.  Temporal Difference Learning Applied to a High-Performance Game-Playing Program , 2001, IJCAI.

[13]  Csaba Szepesvári,et al.  RSPSA: Enhanced Parameter Optimization in Games , 2006, ACG.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  R. Rubinstein The Cross-Entropy Method for Combinatorial and Continuous Optimization , 1999 .

[16]  Andrew Tridgell,et al.  Experiments in Parameter Learning Using Temporal Differences , 1998, J. Int. Comput. Games Assoc..

[17]  Bruno Bouzy,et al.  Associating domain-dependent knowledge and Monte Carlo approaches within a Go program , 2005, Inf. Sci..

[18]  Donald F. Beal,et al.  Temporal Difference Learning for Heuristic Search and Game Playing , 2000, Inf. Sci..

[19]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[20]  Jos W. H. M. Uiterwijk,et al.  Temporal Difference Learning and the Neural MoveMap Heuristic in the Game of Lines of Action , 2002 .

[21]  Dirk P. Kroese,et al.  Convergence properties of the cross-entropy method for discrete optimization , 2007, Oper. Res. Lett..

[22]  H. Robbins A Stochastic Approximation Method , 1951 .

[23]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[24]  Erik C. D. van der Werf STEENVRETER WINS 9x9 GO TOURNAMENT , 2007 .