A Multistep Lyapunov Approach for Finite-Time Analysis of Biased Stochastic Approximation

Motivated by the widespread use of temporal-difference (TD-) and Q-learning algorithms in reinforcement learning, this paper studies a class of biased stochastic approximation (SA) procedures under a mild "ergodic-like" assumption on the underlying stochastic noise sequence. Building upon a carefully designed multistep Lyapunov function that looks ahead to several future updates to accommodate the stochastic perturbations (for control of the gradient bias), we prove a general result on the convergence of the iterates, and use it to derive non-asymptotic bounds on the mean-square error in the case of constant stepsizes. This novel looking-ahead viewpoint renders finite-time analysis of biased SA algorithms under a large family of stochastic perturbations possible. For direct comparison with existing contributions, we also demonstrate these bounds by applying them to TD- and Q-learning with linear function approximation, under the practical Markov chain observation model. The resultant finite-time error bound for both the TD- as well as the Q-learning algorithms is the first of its kind, in the sense that it holds i) for the unmodified versions (i.e., without making any modifications to the parameter updates) using even nonlinear function approximators; as well as for Markov chains ii) under general mixing conditions and iii) starting from any initial distribution, at least one of which has to be violated for existing results to be applicable.

[1]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[2]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[3]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[4]  R. Durrett Probability: Theory and Examples , 1993 .

[5]  Y. Ollivier,et al.  CURVATURE, CONCENTRATION AND ERROR ESTIMATES FOR MARKOV CHAIN MONTE CARLO , 2009, 0904.1312.

[6]  R. Srikant,et al.  Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning , 2019, NeurIPS.

[7]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[8]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[11]  Shie Mannor,et al.  Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[12]  Yingbin Liang,et al.  Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation , 2019, ArXiv.

[13]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[14]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[15]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[16]  Kaiqing Zhang,et al.  Finite-Sample Analysis For Decentralized Batch Multi-Agent Reinforcement Learning With Networked Agents. , 2018 .

[17]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[18]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[19]  Benjamin Recht,et al.  Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator , 2019, NeurIPS.

[20]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[21]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[22]  Bin Hu,et al.  Characterizing the Exact Behaviors of Temporal Difference Learning Algorithms Using Markov Jump Linear System Theory , 2019, NeurIPS.

[23]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[24]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[25]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[26]  Gang Wang,et al.  Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[27]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[28]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[29]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[30]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Eric Moulines,et al.  Non-asymptotic Analysis of Biased Stochastic Approximation Scheme , 2019, COLT.

[33]  Ruggero Carli,et al.  Lyapunov Theory for Discrete Time Systems , 2018, 1809.05289.

[34]  Thinh T. Doan,et al.  Performance of Q-learning with Linear Function Approximation: Stability and Finite-Time Analysis , 2019 .

[35]  Brian D. O. Anderson,et al.  Lyapunov Criterion for Stochastic Systems and Its Applications in Distributed Computation , 2019, IEEE Transactions on Automatic Control.