论文信息 - Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning

Motivated by the study of $Q$-learning algorithms in reinforcement learning, we study a class of stochastic approximation procedures based on operators that satisfy monotonicity and quasi-contractivity conditions with respect to an underlying cone. We prove a general sandwich relation on the iterate error at each time, and use it to derive non-asymptotic bounds on the error in terms of a cone-induced gauge norm. These results are derived within a deterministic framework, requiring no assumptions on the noise. We illustrate these general bounds in application to synchronous $Q$-learning for discounted Markov decision processes with discrete state-action spaces, in particular by deriving non-asymptotic bounds on the $\ell_\infty$-norm for a range of stepsizes. These results are the sharpest known to date, and we show via simulation that the dependence of our bounds cannot be improved in a worst-case sense. These results show that relative to a model-based $Q$-iteration, the $\ell_\infty$-based sample complexity of $Q$-learning is suboptimal in terms of the discount factor $\gamma$.

Martin J. Wainwright | M. Wainwright

[1] John N. Tsitsiklis,et al. Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[2] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[3] Martin J. Wainwright,et al. High-Dimensional Statistics , 2019 .

[4] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[5] Martin J. Wainwright,et al. From Gauss to Kolmogorov: Localized Measures of Complexity for Ellipses , 2018, Electronic Journal of Statistics.

[6] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[8] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[9] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[10] Carlos S. Kubrusly,et al. Stochastic approximation algorithms and applications , 1973, CDC 1973.

[11] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[12] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.

[13] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[14] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[17] Stojan Radenovic,et al. Author's Personal Copy Applied Mathematics Letters a Note on the Equivalence of Some Metric and Cone Metric Fixed Point Results , 2022 .

[18] Csaba Szepesvári,et al. The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[19] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[20] Benjamin Recht,et al. The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint , 2018, COLT.

[21] R. Tourky,et al. Cones and duality , 2007 .

[22] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[23] Hilbert J. Kappen,et al. Speedy Q-Learning , 2011, NIPS.

[24] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[25] C.C. White,et al. Dynamic programming and stochastic control , 1978, Proceedings of the IEEE.

[26] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .