Monte-Carlo Planning: Theoretically Fast Convergence Meets Practical Efficiency

Popular Monte-Carlo tree search (MCTS) algorithms for online planning, such as e-greedy tree search and UCT, aim at rapidly identifying a reasonably good action, but provide rather poor worst-case guarantees on performance improvement over time. In contrast, a recently introduced MCTS algorithm BRUE guarantees exponential-rate improvement over time, yet it is not geared towards identifying reasonably good choices right at the go. We take a stand on the individual strengths of these two classes of algorithms, and show how they can be effectively connected. We then rationalize a principle of "selective tree expansion", and suggest a concrete implementation of this principle within MCTS. The resulting algorithms favorably compete with other MCTS algorithms under short planning times, while preserving the attractive convergence properties of BRUE.

[1]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[2]  David Tolpin,et al.  Doing Better Than UCT: Rational Monte Carlo Sampling in Trees , 2011, ArXiv.

[3]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4]  Tristan Cazenave,et al.  Nested Monte-Carlo Search , 2009, IJCAI.

[5]  David Tolpin,et al.  MCTS Based on Simple Regret , 2012, AAAI.

[6]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[7]  Alan Fern,et al.  UCT for Tactical Assault Planning in Real-Time Strategy Games , 2009, IJCAI.

[8]  Christopher D. Rosin,et al.  Nested Rollout Policy Adaptation for Monte Carlo Tree Search , 2011, IJCAI.

[9]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[10]  Frédérick Garcia,et al.  On-Line Search for Solving Markov Decision Processes via Heuristic Sampling , 2004, ECAI.

[11]  Carmel Domshlak,et al.  Simple Regret Optimization in Online Planning for Markov Decision Processes , 2012, J. Artif. Intell. Res..

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[14]  Nathan R. Sturtevant,et al.  An Analysis of UCT in Multi-Player Games , 2008, J. Int. Comput. Games Assoc..

[15]  Alan Fern,et al.  Lower Bounding Klondike Solitaire with Monte-Carlo Planning , 2009, ICAPS.

[16]  David Tolpin,et al.  Selecting Computations: Theory and Applications , 2012, UAI.

[17]  Mausam,et al.  LRTDP Versus UCT for Online Probabilistic Planning , 2012, AAAI.

[18]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[19]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[20]  Lucian Busoniu,et al.  Optimistic planning for Markov decision processes , 2012, AISTATS.

[21]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[24]  Malte Helmert,et al.  High-Quality Policies for the Canadian Traveler's Problem , 2010, SOCS.

[25]  Subbarao Kambhampati,et al.  Probabilistic Planning via Determinization in Hindsight , 2008, AAAI.

[26]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[27]  Blai Bonet,et al.  Action Selection for MDPs: Anytime AO* Versus UCT , 2012, AAAI.