Reinforcement Learning with a Near Optimal Rate of Convergence

We consider the problem of model-free reinforcement learning in the Markovian decision processes (MDP) under the PAC ("probably approximately correct") model. We introduce a new variant of Q-learning, called speedy Q-learning (SQL), to address the problem of the slow convergence in the standard Q-learning algorithm, and prove PAC bounds on the performance of SQL. The bounds show that for any MDP with n state-action pairs and the discount factor \gamma \in [0, 1) a total of O(n \log(n/\delta)/((1 − \gamma)^4\epsilon^2)) step suffices for the SQL algorithm to converge to an \epsilon-optimal action-value function with probability 1 − \delta. We also establish a lower-bound of \Omega(n \log(1/\delta)/((1 − \gamma)^2\epsilon^2)) for all reinforcement learning algorithms, which matches the upper bound in terms of \epsilon, \delta and n (up to a logarithmic factor). Further, our results have better dependencies on \epsilon and 1 −\gamma, and thus, are tighter than the best available results for Q-learning. SQL also improves on existing results for the batch Q-value iteration, so far considered to be more efficient than the incremental methods like Q-learning.

[1]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[2]  R. Bellman Dynamic programming. , 1957, Science.

[3]  C. Watkins Learning from delayed rewards , 1989 .

[4]  Reid G. Simmons,et al.  Complexity Analysis of Real-Time Reinforcement Learning , 1993, AAAI.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[9]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[10]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[11]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[12]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[13]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[16]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[17]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[18]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[19]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[20]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[21]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[22]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[23]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.