Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation

Though the convergence of major reinforcement learning algorithms has been extensively studied, the finite-sample analysis to further characterize the convergence rate in terms of the sample complexity for problems with continuous state space is still very limited. Such a type of analysis is especially challenging for algorithms with dynamically changing learning policies and under non-i.i.d.\ sampled data. In this paper, we present the first finite-sample analysis for the SARSA algorithm and its minimax variant (for zero-sum Markov games), with a single sample path and linear function approximation. To establish our results, we develop a novel technique to bound the gradient bias for dynamically changing learning policies, which can be of independent interest. We further provide finite-sample bounds for Q-learning and its minimax variant. Comparison of our result with the existing finite-sample bound indicates that linear function approximation achieves order-level lower sample complexity than the nearest neighbor approach.

[1]  Devavrat Shah,et al.  Q-learning with Nearest Neighbors , 2018, NeurIPS.

[2]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[3]  Michail G. Lagoudakis,et al.  Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[4]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[5]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[6]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[7]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8]  Mark W. Schmidt,et al.  A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[9]  Csaba Szepesvári,et al.  Statistical linear estimation with penalized estimators: an application to reinforcement learning , 2012, ICML.

[10]  Zhuoran Yang,et al.  A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[11]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[12]  Peter W. Glynn,et al.  Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice , 2000, NIPS.

[13]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[14]  Shie Mannor,et al.  Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[15]  Bruno Scherrer,et al.  On the Rate of Convergence and Error Bounds for LSTD(\(\lambda\)) , 2015, ICML.

[16]  Bruno Scherrer,et al.  On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[17]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[18]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[19]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[20]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[21]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[22]  Olivier Pietquin,et al.  Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[23]  Sean P. Meyn,et al.  An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[24]  Benjamin Recht,et al.  Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator , 2017, ICML.

[25]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[26]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[27]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[28]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[29]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[30]  Shie Mannor,et al.  Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[31]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[32]  Sean P. Meyn,et al.  Differential Temporal Difference Learning , 2018, IEEE Transactions on Automatic Control.

[33]  Rémi Munos,et al.  Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic Control , 2013, ECML/PKDD.

[34]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[37]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[38]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[39]  Tamer Basar,et al.  Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning , 2018, ArXiv.

[40]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[41]  Shalabh Bhatnagar,et al.  Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games , 2015, AAMAS.

[42]  Pascal Vincent,et al.  Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[43]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[44]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[45]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[46]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[47]  Dirk Ormoneit,et al.  Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[48]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[49]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[50]  Benjamin Van Roy,et al.  On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[51]  H. Kushner Stochastic approximation: a survey , 2010 .

[52]  Matthieu Geist,et al.  Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[53]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[54]  A. Y. Mitrophanov,et al.  Sensitivity and convergence of uniformly ergodic Markov chains , 2005 .

[55]  Theodore J. Perkins,et al.  On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[56]  Alessandro Lazaric,et al.  LSTD with Random Projections , 2010, NIPS.

[57]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.