Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-\gamma)$, where $\gamma 0$ is an upper bound on the spectral gap of an optimal transition matrix.

[1]  E. Bolthausen The Berry-Esseen theorem for functionals of discrete Markov chains , 1980 .

[2]  D. Ruppert A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[3]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[4]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[5]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[10]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[11]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[12]  V. Borkar,et al.  An analog scheme for fixed point computation. I. Theory , 1997 .

[13]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[14]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[15]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[16]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[17]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[18]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[19]  P. Glynn,et al.  Hoeffding's inequality for uniformly ergodic Markov chains , 2002 .

[20]  S. Meyn,et al.  Spectral theory and limit theorems for geometrically ergodic Markov processes , 2002, math/0209200.

[21]  R. Schwabe,et al.  A law of the iterated logarithm for stochastic approximation procedures in d-dimensional Euclidean space , 2003 .

[22]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[23]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[24]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[25]  A. Mokkadem,et al.  The Compact Law of the Iterated Logarithm for Multivariate Stochastic Approximation Algorithms , 2005 .

[26]  Sean P. Meyn,et al.  Relative entropy and exponential deviation bounds for general Markov chains , 2005, Proceedings. International Symposium on Information Theory, 2005. ISIT 2005..

[27]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[28]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[29]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[30]  H. Robbins A Stochastic Approximation Method , 1951 .

[31]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[32]  Sean P. Meyn,et al.  Q-learning and Pontryagin's Minimum Principle , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[33]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[34]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[35]  Sean P. Meyn,et al.  TD-learning with exploration , 2011, IEEE Conference on Decision and Control and European Control Conference.

[36]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[37]  Dimitri P. Bertsekas,et al.  Q-learning and policy iteration algorithms for stochastic shortest path problems , 2012, Annals of Operations Research.

[38]  Tor Lattimore,et al.  The Sample-Complexity of General Reinforcement Learning , 2013, ICML.

[39]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[40]  Sean P. Meyn,et al.  Zap Q-Learning , 2017, NIPS.

[41]  Sean P. Meyn,et al.  Fastest Convergence for Q-learning , 2017, ArXiv.

[42]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[43]  Sean P. Meyn,et al.  Zap Q-Learning - A User's Guide , 2019, 2019 Fifth Indian Control Conference (ICC).

[44]  Chong Li,et al.  Model-Free Reinforcement Learning , 2019, Reinforcement Learning for Cyber-Physical Systems.

[45]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[46]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[47]  Benoît R. Kloeckner Effective Berry–Esseen and concentration bounds for Markov chains with a spectral gap , 2019, The Annals of Applied Probability.

[48]  John E. R. Staddon,et al.  The dynamics of behavior: Review of Sutton and Barto: Reinforcement Learning : An Introduction (2 nd ed.) , 2020 .

[49]  Adam Wierman,et al.  Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[50]  Ana Busic,et al.  Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation , 2020, AISTATS.

[51]  Siva Theja Maguluri,et al.  Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes , 2020, ArXiv.

[52]  Sean P. Meyn,et al.  Fundamental Design Principles for Reinforcement Learning Algorithms , 2021, Handbook of Reinforcement Learning and Control.