Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O((n/e2)log(1/δ)) times to find an e-optimal arm with probability of at least 1-δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over e-greedy Q-learning.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  J. MacQueen A MODIFIED DYNAMIC PROGRAMMING METHOD FOR MARKOVIAN DECISION PROBLEMS , 1966 .

[3]  Leslie G. Valiant,et al.  Fast probabilistic algorithms for hamiltonian circuits and matchings , 1977, STOC '77.

[4]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[5]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[6]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[7]  Anders R. Kristensen,et al.  Dynamic programming and Markov decision processes , 1996 .

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[10]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[11]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[12]  Boaz Patt-Shamir,et al.  Buffer overflow management in QoS switches , 2001, STOC '01.

[13]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[14]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[15]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[16]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[17]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[18]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[19]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[20]  Yishay Mansour,et al.  Competitive queue policies for differentiated services , 2000, Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No.00CH37064).

[21]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[22]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[23]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[24]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .