论文信息 - Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a "safe" optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space first) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions.

[1] Thomas J. Walsh,et al. Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[2] J. Ingersoll. Theory of Financial Decision Making , 1987 .

[3] S. Murphy,et al. Optimal dynamic treatment regimes , 2003 .

[4] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[5] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6] Rémi Munos,et al. Optimistic Planning of Deterministic Systems , 2008, EWRL.

[7] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[8] Marko Bacic,et al. Model predictive control , 2003 .

[9] Lucian Busoniu,et al. Optimistic planning for belief-augmented Markov Decision Processes , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[10] Olivier Teytaud,et al. Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[11] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[12] Nils J. Nilsson,et al. A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[13] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[14] Rémi Munos,et al. From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[15] Lucian Busoniu,et al. Optimistic planning for Markov decision processes , 2012, AISTATS.

[16] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[17] Louis Wehenkel,et al. Lazy Planning under Uncertainty by Optimizing Decisions on an Ensemble of Incomplete Disturbance Trees , 2008, EWRL.

[18] Stefan Schaal,et al. Reinforcement Learning for Humanoid Robotics , 2003 .