Bayesian Inference in Monte-Carlo Tree Search

Monte-Carlo Tree Search (MCTS) methods are drawing great interest after yielding breakthrough results in computer Go. This paper proposes a Bayesian approach to MCTS that is inspired by distributionfree approaches such as UCT [13], yet significantly differs in important respects. The Bayesian framework allows potentially much more accurate (Bayes-optimal) estimation of node values and node uncertainties from a limited number of simulation trials. We further propose propagating inference in the tree via fast analytic Gaussian approximation methods: this can make the overhead of Bayesian inference manageable in domains such as Go, while preserving high accuracy of expected-value estimates. We find substantial empirical outperformance of UCT in an idealized bandit-tree test environment, where we can obtain valuable insights by comparing with known ground truth. Additionally we rigorously prove on-policy and off-policy convergence of the proposed methods.

[1]  C. E. Clark The Greatest of a Finite Set of Random Variables , 1961 .

[2]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[3]  Eric B. Baum,et al.  A Bayesian Approach to Relevance in Game Playing , 1997, Artif. Intell..

[4]  Bruno Bouzy,et al.  Associating Shallow and Selective Global Tree Search with Monte Carlo for 9*9 Go , 2004, Computers and Games.

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Jos W. H. M. Uiterwijk,et al.  Monte-Carlo tree search in production management problems , 2006 .

[7]  P. Maes How to Do the Right Thing , 1989 .

[8]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[9]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[12]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[13]  Gabriel Kronberger,et al.  Bandit-Based Monte-Carlo Planning for the Single-Machine Total Weighted Tardiness Scheduling Problem , 2007, EUROCAST.

[14]  Hai Zhou,et al.  Advances in Computation of the Maximum of a Set of Gaussian Random Variables , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[15]  H. Jaap van den Herik,et al.  Single-Player Monte-Carlo Tree Search , 2008, Computers and Games.

[16]  Yngvi Björnsson,et al.  Simulation-Based Approach to General Game Playing , 2008, AAAI.

[17]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.