Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP), based on a single trajectory of Markovian samples induced by a behavior policy. Focusing on a <inline-formula> <tex-math notation="LaTeX">$\gamma $ </tex-math></inline-formula>-discounted MDP with state space <inline-formula> <tex-math notation="LaTeX">$\mathcal {S}$ </tex-math></inline-formula> and action space <inline-formula> <tex-math notation="LaTeX">$\mathcal {A}$ </tex-math></inline-formula>, we demonstrate that the <inline-formula> <tex-math notation="LaTeX">$\ell _{\infty }$ </tex-math></inline-formula>-based sample complexity of classical asynchronous Q-learning — namely, the number of samples needed to yield an entrywise <inline-formula> <tex-math notation="LaTeX">$\varepsilon $ </tex-math></inline-formula>-accurate estimate of the Q-function — is at most on the order of <inline-formula> <tex-math notation="LaTeX">$\frac {1}{ \mu _{\mathsf {min}}(1-\gamma)^{5}\varepsilon ^{2}}+ \frac { t_{\mathsf {mix}}}{ \mu _{\mathsf {min}}(1-\gamma)}$ </tex-math></inline-formula> up to some logarithmic factor, provided that a proper constant learning rate is adopted. Here, <inline-formula> <tex-math notation="LaTeX">$t_{\mathsf {mix}}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\mu _{\mathsf {min}}$ </tex-math></inline-formula> denote respectively the mixing time and the minimum state-action occupancy probability of the sample trajectory. The first term of this bound matches the sample complexity in the synchronous case with independent samples drawn from the stationary distribution of the trajectory. The second term reflects the cost taken for the empirical distribution of the Markovian trajectory to reach a steady state, which is incurred at the very beginning and becomes amortized as the algorithm runs. Encouragingly, the above bound improves upon the state-of-the-art result by a factor of at least <inline-formula> <tex-math notation="LaTeX">$|\mathcal {S}||\mathcal {A}|$ </tex-math></inline-formula> for all scenarios, and by a factor of at least <inline-formula> <tex-math notation="LaTeX">$t_{\mathsf {mix}}|\mathcal {S}||\mathcal {A}|$ </tex-math></inline-formula> for any sufficiently small accuracy level <inline-formula> <tex-math notation="LaTeX">$\varepsilon $ </tex-math></inline-formula>. Further, we demonstrate that the scaling on the effective horizon <inline-formula> <tex-math notation="LaTeX">$\frac {1}{1-\gamma }$ </tex-math></inline-formula> can be improved by means of variance reduction.

[1]  Yuxin Chen,et al.  Tightening the Dependence on Horizon in the Sample Complexity of Q-Learning , 2021, ICML.

[2]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[3]  Adam Wierman,et al.  Distributed Reinforcement Learning in Multi-Agent Networked Systems , 2020, ArXiv.

[4]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[5]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[6]  R. Srikant,et al.  Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning , 2019, NeurIPS.

[7]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[8]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[9]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[10]  Thinh T. Doan,et al.  Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning , 2019, ICML.

[11]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[12]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[13]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[14]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[15]  H. Kappen,et al.  Reinforcement Learning with a Near Optimal Rate of Convergence , 2011 .

[16]  Shie Mannor,et al.  Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[17]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[18]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[19]  Devavrat Shah,et al.  Q-learning with Nearest Neighbors , 2018, NeurIPS.

[20]  R. Srikant,et al.  Error bounds for constant step-size Q-learning , 2012, Syst. Control. Lett..

[21]  Siva Theja Maguluri,et al.  Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes , 2020, ArXiv.

[22]  Abhijit Gosavi,et al.  Boundedness of iterates in Q-Learning , 2006, Syst. Control. Lett..

[23]  C. Watkins Learning from delayed rewards , 1989 .

[24]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[25]  Adithya M. Devraj,et al.  Q-learning with Uniformly Bounded Variance: Large Discounting is Not a Barrier to Fast Learning , 2020, ArXiv.

[26]  Quanquan Gu,et al.  A Finite-Time Analysis of Q-Learning with Neural Network Function Approximation , 2020, ICML.

[27]  Lam M. Nguyen,et al.  Convergence Rates of Accelerated Markov Gradient Descent with Applications in Reinforcement Learning , 2020, 2002.02873.

[28]  Lihong Li,et al.  Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[29]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[30]  Yu Bai,et al.  Provably Efficient Q-Learning with Low Switching Cost , 2019, NeurIPS.

[31]  Yingbin Liang,et al.  Reanalysis of Variance Reduced Temporal Difference Learning , 2020, ICLR.

[32]  Martin J. Wainwright,et al.  On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration , 2020, COLT.

[33]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[34]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[35]  Xian Wu,et al.  Variance reduced value iteration and faster algorithms for solving Markov decision processes , 2017, SODA.

[36]  Niao He,et al.  Provably-Efficient Double Q-Learning , 2020, ArXiv.

[37]  Martin J. Wainwright,et al.  Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis , 2020, SIAM J. Math. Data Sci..

[38]  Adam Wierman,et al.  Finite-Time Analysis of Asynchronous Stochastic Approximation and Q-Learning , 2020, COLT.

[39]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[40]  Yuantao Gu,et al.  Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model , 2020, NeurIPS.

[41]  D. Paulin Concentration inequalities for Markov chains by Marton couplings and spectral methods , 2012, 1212.2015.

[42]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[43]  Shie Mannor,et al.  Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[44]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[45]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[46]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[47]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[48]  Donghwan Lee,et al.  Target-Based Temporal Difference Learning , 2019, ICML.

[49]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[50]  Yingbin Liang,et al.  Finite-Time Analysis for Double Q-learning , 2020, NeurIPS.

[51]  Thinh T. Doan,et al.  Performance of Q-learning with Linear Function Approximation: Stability and Finite-Time Analysis , 2019 .

[52]  Bowen Weng,et al.  Momentum Q-learning with Finite-Sample Convergence Guarantee , 2020, ArXiv.

[53]  Yingbin Liang,et al.  Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples , 2019, NeurIPS.

[54]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[55]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[56]  Wotao Yin,et al.  Markov chain block coordinate descent , 2018, Computational Optimization and Applications.

[57]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[58]  Hoi-To Wai,et al.  Finite Time Analysis of Linear Two-timescale Stochastic Approximation with Markovian Noise , 2020, COLT.

[59]  Lin F. Yang,et al.  Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal , 2019, COLT 2020.

[60]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[61]  Changxiao Cai,et al.  Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis , 2021 .

[62]  Ruosong Wang,et al.  Provably Efficient Q-learning with Function Approximation via Distribution Shift Error Checking Oracle , 2019, NeurIPS.

[63]  Mengdi Wang,et al.  Sample-Optimal Parametric Q-Learning Using Linearly Additive Features , 2019, ICML.

[64]  Yingbin Liang,et al.  Finite-Sample Analysis for SARSA with Linear Function Approximation , 2019, NeurIPS.

[65]  Ruosong Wang,et al.  Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity , 2020, ArXiv.

[66]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[67]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.