Open Loop Optimistic Planning

We consider the problem of planning in a stochastic and discounted environment with a limited numerical budget. More precisely, we investigate strategies exploring the set of possible sequences of actions, so that, once all available numerical resources (e.g. CPU time, number of calls to a generative model) have been used, one returns a recommendation on the best possible immediate action to follow based on this exploration. The performance of a strategy is assessed in terms of its simple regret, that is the loss in performance resulting from choosing the recommended action instead of an optimal one. We first provide a minimax lower bound for this problem, and show that a uniform planning strategy matches this minimax rate (up to a logarithmic factor). Then we propose a UCB (Upper Confidence Bounds)-based planning algorithm, called OLOP (Open-Loop Optimistic Planning), which is also minimax optimal, and prove that it enjoys much faster rates when there is a small proportion of near-optimal sequences of actions. Finally, we compare our results with the regret bounds one can derive for our setting with bandits algorithms designed for an infinite number of arms.

[1]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT 2010.

[2]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[3]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[4]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[5]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[6]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.

[7]  Nan Rong,et al.  What makes some POMDP problems easy to approximate? , 2007, NIPS.

[8]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[9]  Cordelia Schmid,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[10]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[11]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[12]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[13]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[14]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[15]  Robert D. Kleinberg,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[16]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[17]  Irini Angelidaki,et al.  Anaerobic digestion model No. 1 (ADM1) , 2002 .

[18]  Ying He,et al.  Simulation-Based Algorithms for Markov Decision Processes , 2002 .

[19]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[20]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .

[21]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .