论文信息 - Open Loop Optimistic Planning

Open Loop Optimistic Planning

We consider the problem of planning in a stochastic and discounted environment with a limited numerical budget. More precisely, we investigate strategies exploring the set of possible sequences of actions, so that, once all available numerical resources (e.g. CPU time, number of calls to a generative model) have been used, one returns a recommendation on the best possible immediate action to follow based on this exploration. The performance of a strategy is assessed in terms of its simple regret, that is the loss in performance resulting from choosing the recommended action instead of an optimal one. We first provide a minimax lower bound for this problem, and show that a uniform planning strategy matches this minimax rate (up to a logarithmic factor). Then we propose a UCB (Upper Confidence Bounds)-based planning algorithm, called OLOP (Open-Loop Optimistic Planning), which is also minimax optimal, and prove that it enjoys much faster rates when there is a small proportion of near-optimal sequences of actions. Finally, we compare our results with the regret bounds one can derive for our setting with bandits algorithms designed for an infinite number of arms.

R. Munos | Sébastien Bubeck

[1] R. Munos,et al. Best Arm Identification in Multi-Armed Bandits , 2010, COLT 2010.

[2] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[3] Csaba Szepesvári,et al. Online Optimization in X-Armed Bandits , 2008, NIPS.

[4] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[5] Rémi Munos,et al. Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[6] Rémi Munos,et al. Optimistic Planning of Deterministic Systems , 2008, EWRL.

[7] Nan Rong,et al. What makes some POMDP problems easy to approximate? , 2007, NIPS.

[8] Peter Auer,et al. Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[9] Cordelia Schmid,et al. Bandit Algorithms for Tree Search , 2007, UAI.

[10] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[11] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .