PAC Bounds for Multi-armed Bandit and Markov Decision Processes

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/?2 log 1/?) times to find an ?-optimal arm with probability of at least 1 - ?. This is in contrast to the naive bound of O(n/?2 log n/?). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound. We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learningalg orithm for Markov Decision Processes. This is done essentially by simulatingV alue Iteration, and in each iteration invokingt he multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.

[1]  J. Gani,et al.  Progress in statistics , 1975 .

[2]  H. Chernoff Sequential Analysis and Optimal Design , 1987 .

[3]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[4]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[5]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Claude-Nicolas Fiechter,et al.  PAC adaptive control of linear systems , 1997, COLT '97.

[8]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[9]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[10]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[11]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[12]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[13]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[14]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[15]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[16]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[17]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[18]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[19]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[20]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[21]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .