论文信息 - Monte-Carlo Planning: Theoretically Fast Convergence Meets Practical Efficiency - 字舞流文

Monte-Carlo Planning: Theoretically Fast Convergence Meets Practical Efficiency

Popular Monte-Carlo tree search (MCTS) algorithms for online planning, such as e-greedy tree search and UCT, aim at rapidly identifying a reasonably good action, but provide rather poor worst-case guarantees on performance improvement over time. In contrast, a recently introduced MCTS algorithm BRUE guarantees exponential-rate improvement over time, yet it is not geared towards identifying reasonably good choices right at the go. We take a stand on the individual strengths of these two classes of algorithms, and show how they can be effectively connected. We then rationalize a principle of "selective tree expansion", and suggest a concrete implementation of this principle within MCTS. The resulting algorithms favorably compete with other MCTS algorithms under short planning times, while preserving the attractive convergence properties of BRUE.

Carmel Domshlak | Zohar Feldman | C. Domshlak | Zohar Feldman

[1] Rémi Munos,et al. Open Loop Optimistic Planning , 2010, COLT.

[2] David Tolpin,et al. Doing Better Than UCT: Rational Monte Carlo Sampling in Trees , 2011, ArXiv.

[3] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[4] Tristan Cazenave,et al. Nested Monte-Carlo Search , 2009, IJCAI.

[5] David Tolpin,et al. MCTS Based on Simple Regret , 2012, AAAI.

[6] Rémi Munos,et al. Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[7] Alan Fern,et al. UCT for Tactical Assault Planning in Real-Time Strategy Games , 2009, IJCAI.

[8] Christopher D. Rosin,et al. Nested Rollout Policy Adaptation for Monte Carlo Tree Search , 2011, IJCAI.

[9] Rémi Munos,et al. Bandit Algorithms for Tree Search , 2007, UAI.

[10] Frédérick Garcia,et al. On-Line Search for Solving Markov Decision Processes via Heuristic Sampling , 2004, ECAI.

[11] Carmel Domshlak,et al. Simple Regret Optimization in Online Planning for Markov Decision Processes , 2012, J. Artif. Intell. Res..

[12] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[14] Nathan R. Sturtevant,et al. An Analysis of UCT in Multi-Player Games , 2008, J. Int. Comput. Games Assoc..

[15] Alan Fern,et al. Lower Bounding Klondike Solitaire with Monte-Carlo Planning , 2009, ICAPS.

[16] David Tolpin,et al. Selecting Computations: Theory and Applications , 2012, UAI.

[17] Mausam,et al. LRTDP Versus UCT for Online Probabilistic Planning , 2012, AAAI.

[18] U. Rieder,et al. Markov Decision Processes , 2010 .

[19] David Silver,et al. Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[20] Lucian Busoniu,et al. Optimistic planning for Markov decision processes , 2012, AISTATS.

[21] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[22] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[24] Malte Helmert,et al. High-Quality Policies for the Canadian Traveler's Problem , 2010, SOCS.

[25] Subbarao Kambhampati,et al. Probabilistic Planning via Determinization in Hindsight , 2008, AAAI.

[26] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[27] Blai Bonet,et al. Action Selection for MDPs: Anytime AO* Versus UCT , 2012, AAAI.