论文信息 - Bandit Based Monte-Carlo Planning

Bandit Based Monte-Carlo Planning

For large state-space Markovian Decision Problems Monte-Carlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling. Experimental results show that in several domains, UCT is significantly more efficient than its alternatives.

Csaba Szepesvári | Levente Kocsis | Csaba Szepesvari | Levente Kocsis

[1] Dana S. Nau,et al. An Analysis of Forward Pruning , 1994, AAAI.

[2] Gerald Tesauro,et al. On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[3] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4] Jonathan Schaeffer,et al. The challenge of poker , 2002, Artif. Intell..

[5] Brian Sheppard,et al. World-championship-caliber Scrabble , 2002, Artif. Intell..

[6] Bruno Bouzy,et al. Monte-Carlo Go Developments , 2003, ACG.

[7] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8] Frédérick Garcia,et al. On-Line Search for Solving Markov Decision Processes via Heuristic Sampling , 2004, ECAI.

[9] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[10] Michael C. Fu,et al. An Adaptive Sampling Algorithm for Solving Markov Decision Processes , 2005, Oper. Res..

[11] Jonathan Schaeffer,et al. Monte Carlo Planning in RTS Games , 2005, CIG.

[12] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .