Static and Dynamic Values of Computation in MCTS

Monte-Carlo Tree Search (MCTS) is one of the most-widely used methods for planning, and has powered many recent advances in artificial intelligence. In MCTS, one typically performs computations (i.e., simulations) to collect statistics about the possible future consequences of actions, and then chooses accordingly. Many popular MCTS methods such as UCT and its variants decide which computations to perform by trading-off exploration and exploitation. In this work, we take a more direct approach, and explicitly quantify the value of a computation based on its expected impact on the quality of the action eventually chosen. Our approach goes beyond the "myopic" limitations of existing computation-value-based methods in two senses: (I) we are able to account for the impact of non-immediate (ie, future) computations (II) on non-immediate actions. We show that policies that greedily optimize computation values are optimal under certain assumptions and obtain results that are competitive with the state-of-the-art.

[1]  Ronald A. Howard,et al.  Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[2]  H. Robbins,et al.  Maximally dependent random variables. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Stuart J. Russell,et al.  Do the right thing - studies in limited rationality , 1991 .

[4]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[5]  Russell Greiner,et al.  The Budgeted Multi-armed Bandit Problem , 2004, COLT.

[6]  P. Maes How to Do the Right Thing , 1989 .

[7]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[8]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[9]  Warren B. Powell,et al.  The Knowledge-Gradient Policy for Correlated Normal Beliefs , 2009, INFORMS J. Comput..

[10]  V. T. Rajan,et al.  Bayesian Inference in Monte-Carlo Tree Search , 2010, UAI.

[11]  Andrew M. Ross Computing Bounds on the Expected Maximum of Correlated Normal Variables , 2010 .

[12]  Stuart J. Russell,et al.  Metareasoning for Monte Carlo Tree Search , 2011 .

[13]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[14]  David Tolpin,et al.  MCTS Based on Simple Regret , 2012, AAAI.

[15]  Warren B. Powell,et al.  The Knowledge Gradient Algorithm for a General Class of Online Learning Problems , 2012, Oper. Res..

[16]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[17]  David Tolpin,et al.  Selecting Computations: Theory and Applications , 2012, UAI.

[18]  Feng Wu,et al.  Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search , 2013, NIPS.

[19]  Peter Dayan,et al.  Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search , 2013, J. Artif. Intell. Res..

[20]  Thomas L. Griffiths,et al.  Algorithm selection by rational metareasoning as a model of human strategy selection , 2014, NIPS.

[21]  Eric Horvitz,et al.  Metareasoning for Planning Under Uncertainty , 2015, IJCAI.

[22]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[23]  Matthias Poloczek,et al.  Bayesian Optimization with Gradients , 2017, NIPS.

[24]  Marcelo G Mattar,et al.  Prioritized memory access explains planning and hippocampal replay , 2017, Nature Neuroscience.

[25]  Mehdi Keramati,et al.  Optimizing the depth and the direction of prospective planning using information values , 2019, PLoS Comput. Biol..