论文信息 - Optimistic Planning of Deterministic Systems

Optimistic Planning of Deterministic Systems

If one possesses a model of a controlled deterministic system, then from any state, one may consider the set of all possible reachable states starting from that state and using any sequence of actions. This forms a tree whose size is exponential in the planning time horizon. Here we ask the question: given finite computational resources (e.g. CPU time), which may not be known ahead of time, what is the best way to explore this tree, such that once all resources have been used, the algorithm would be able to propose an action (or a sequence of actions) whose performance is as close as possible to optimality? The performance with respect to optimality is assessed in terms of the regret (with respect to the sum of discounted future rewards) resulting from choosing the action returned by the algorithm instead of an optimal action. In this paper we investigate an optimistic exploration of the tree, where the most promising states are explored first, and compare this approach to a naive uniform exploration. Bounds on the regret are derived both for uniform and optimistic exploration strategies. Numerical simulations illustrate the benefit of optimistic planning.

Rémi Munos | Jean-François Hren | R. Munos | Jean-François Hren

[1] Andrew P. Sage,et al. Uncertainty in Artificial Intelligence , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[2] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[4] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[6] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[7] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[8] Frédérick Garcia,et al. On-Line Search for Solving Markov Decision Processes via Heuristic Sampling , 2004, ECAI.

[9] Yishay Mansour,et al. A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[10] Frederick Garcia. On-line search for solving large Markov de-cision processes , 2004 .

[11] Olivier Teytaud,et al. Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[12] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[13] Rémi Munos,et al. Bandit Algorithms for Tree Search , 2007, UAI.

[14] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[15] Leon G. Higley,et al. Forensic Entomology: An Introduction , 2009 .

[16] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .