Adaptive Sampling for Best Policy Identification in Markov Decision Processes

We investigate the problem of best-policy identification in discounted Markov Decision Processes (MDPs) when the learner has access to a generative model. The objective is to devise a learning algorithm returning the best policy as early as possible. We first derive a problem-specific lower bound of the sample complexity satisfied by any learning algorithm. This lower bound corresponds to an optimal sample allocation that solves a non-convex program, and hence, is hard to exploit in the design of efficient algorithms. We then provide a simple and tight upper bound of the sample complexity lower bound, whose corresponding nearly-optimal sample allocation becomes explicit. The upper bound depends on specific functionals of the MDP such as the sub-optimality gaps and the variance of the next-state value function, and thus really captures the hardness of the MDP. Finally, we devise KLB-TS (KL Ball Track-and-Stop), an algorithm tracking this nearly-optimal allocation, and provide asymptotic guarantees for its sample complexity (both almost surely and in expectation). The advantages of KLB-TS against state-of-the-art algorithms are discussed and illustrated numerically.

[1]  Wouter M. Koolen,et al.  Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals , 2018, J. Mach. Learn. Res..

[2]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[3]  Aurélien Garivier,et al.  Optimal Best Arm Identification with Fixed Confidence , 2016, COLT.

[4]  Aurélien Garivier,et al.  Non-Asymptotic Sequential Tests for Overlapping Hypotheses and application to near optimal arm identification in bandit models , 2019 .

[5]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[6]  Yuxin Chen,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[7]  R. Reiss Approximate Distributions of Order Statistics: With Applications to Nonparametric Statistics , 1989 .

[8]  Anders Jonsson,et al.  Planning in Markov Decision Processes with Gap-Dependent Sample Complexity , 2020, NeurIPS.

[9]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[10]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[11]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[12]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[13]  Mykel J. Kochenderfer,et al.  Almost Horizon-Free Structure-Aware Best Policy Identification with a Generative Model , 2019, NeurIPS.

[14]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[15]  H. Chernoff Sequential Design of Experiments , 1959 .

[16]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT 2020.