论文信息 - Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning

Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds have a factor that is a polynomial in $1/(1-\gamma)$, where $\gamma 0$ is an upper bound on the spectral gap of an optimal transition matrix.

Adithya M. Devraj | Sean P. Meyn | S. Meyn

[1] E. Bolthausen. The Berry-Esseen theorem for functionals of discrete Markov chains , 1980 .

[2] D. Ruppert. A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[3] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[4] Richard L. Tweedie,et al. Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[5] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[6] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[8] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[10] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[11] Harold J. Kushner,et al. Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[12] V. Borkar,et al. An analog scheme for fixed point computation. I. Theory , 1997 .

[13] Csaba Szepesvári,et al. The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[14] John N. Tsitsiklis,et al. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[15] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[16] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[17] Vivek S. Borkar,et al. Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[18] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[19] P. Glynn,et al. Hoeffding's inequality for uniformly ergodic Markov chains , 2002 .

[20] S. Meyn,et al. Spectral theory and limit theorems for geometrically ergodic Markov processes , 2002, math/0209200.

[21] R. Schwabe,et al. A law of the iterated logarithm for stochastic approximation procedures in d-dimensional Euclidean space , 2003 .

[22] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[23] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[24] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[25] A. Mokkadem,et al. The Compact Law of the Iterated Logarithm for Multivariate Stochastic Approximation Algorithms , 2005 .

[26] Sean P. Meyn,et al. Relative entropy and exponential deviation bounds for general Markov chains , 2005, Proceedings. International Symposium on Information Theory, 2005. ISIT 2005..

[27] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[28] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[29] David Choi,et al. A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[30] H. Robbins. A Stochastic Approximation Method , 1951 .

[31] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[32] Sean P. Meyn,et al. Q-learning and Pontryagin's Minimum Principle , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[33] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[34] Hilbert J. Kappen,et al. Speedy Q-Learning , 2011, NIPS.

[35] Sean P. Meyn,et al. TD-learning with exploration , 2011, IEEE Conference on Decision and Control and European Control Conference.

[36] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[37] Dimitri P. Bertsekas,et al. Q-learning and policy iteration algorithms for stochastic shortest path problems , 2012, Annals of Operations Research.

[38] Tor Lattimore,et al. The Sample-Complexity of General Reinforcement Learning , 2013, ICML.

[39] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[40] Sean P. Meyn,et al. Zap Q-Learning , 2017, NIPS.

[41] Sean P. Meyn,et al. Fastest Convergence for Q-learning , 2017, ArXiv.

[42] R. Srikant,et al. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[43] Sean P. Meyn,et al. Zap Q-Learning - A User's Guide , 2019, 2019 Fifth Indian Control Conference (ICC).

[44] Chong Li,et al. Model-Free Reinforcement Learning , 2019, Reinforcement Learning for Cyber-Physical Systems.

[45] Martin J. Wainwright,et al. Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[46] Martin J. Wainwright,et al. Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[47] Benoît R. Kloeckner. Effective Berry–Esseen and concentration bounds for Markov chains with a spectral gap , 2019, The Annals of Applied Probability.

[48] John E. R. Staddon,et al. The dynamics of behavior: Review of Sutton and Barto: Reinforcement Learning : An Introduction (2 nd ed.) , 2020 .

[49] Adam Wierman,et al. Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[50] Ana Busic,et al. Explicit Mean-Square Error Bounds for Monte-Carlo and Linear Stochastic Approximation , 2020, AISTATS.

[51] Siva Theja Maguluri,et al. Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes , 2020, ArXiv.

[52] Sean P. Meyn,et al. Fundamental Design Principles for Reinforcement Learning Algorithms , 2021, Handbook of Reinforcement Learning and Control.