论文信息 - Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation - 字舞流文

Finite-Sample Analysis for SARSA and Q-Learning with Linear Function Approximation

Though the convergence of major reinforcement learning algorithms has been extensively studied, the finite-sample analysis to further characterize the convergence rate in terms of the sample complexity for problems with continuous state space is still very limited. Such a type of analysis is especially challenging for algorithms with dynamically changing learning policies and under non-i.i.d.\ sampled data. In this paper, we present the first finite-sample analysis for the SARSA algorithm and its minimax variant (for zero-sum Markov games), with a single sample path and linear function approximation. To establish our results, we develop a novel technique to bound the gradient bias for dynamically changing learning policies, which can be of independent interest. We further provide finite-sample bounds for Q-learning and its minimax variant. Comparison of our result with the existing finite-sample bound indicates that linear function approximation achieves order-level lower sample complexity than the nearest neighbor approach.

Yingbin Liang | Shaofeng Zou | Tengyu Xu | Shaofeng Zou | Yingbin Liang | Tengyu Xu

[1] Devavrat Shah,et al. Q-learning with Nearest Neighbors , 2018, NeurIPS.

[2] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[3] Michail G. Lagoudakis,et al. Value Function Approximation in Zero-Sum Markov Games , 2002, UAI.

[4] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[5] Csaba Szepesvári,et al. Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[6] Chen-Yu Wei,et al. Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[7] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8] Mark W. Schmidt,et al. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method , 2012, ArXiv.

[9] Csaba Szepesvári,et al. Statistical linear estimation with penalized estimators: an application to reinforcement learning , 2012, ICML.

[10] Zhuoran Yang,et al. A Theoretical Analysis of Deep Q-Learning , 2019, L4DC.

[11] Csaba Szepesvári,et al. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[12] Peter W. Glynn,et al. Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice , 2000, NIPS.

[13] Vincent Conitzer,et al. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[14] Shie Mannor,et al. Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[15] Bruno Scherrer,et al. On the Rate of Convergence and Error Bounds for LSTD(\(\lambda\)) , 2015, ICML.

[16] Bruno Scherrer,et al. On the Use of Non-Stationary Strategies for Solving Two-Player Zero-Sum Markov Games , 2016, AISTATS.

[17] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[18] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[19] Alessandro Lazaric,et al. Finite-Sample Analysis of LSTD , 2010, ICML.

[20] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[21] Csaba Szepesvári,et al. Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[22] Olivier Pietquin,et al. Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[23] Sean P. Meyn,et al. An analysis of reinforcement learning with function approximation , 2008, ICML '08.

[24] Benjamin Recht,et al. Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator , 2017, ICML.

[25] Bruno Scherrer,et al. Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[26] Alessandro Lazaric,et al. Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[27] J. W. Nieuwenhuis,et al. Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[28] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[29] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[30] Shie Mannor,et al. Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[31] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[32] Sean P. Meyn,et al. Differential Temporal Difference Learning , 2018, IEEE Transactions on Automatic Control.

[33] Rémi Munos,et al. Fast LSTD Using Stochastic Approximation: Finite Time Analysis and Application to Traffic Control , 2013, ECML/PKDD.

[34] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[35] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[37] Andrew G. Barto,et al. Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[38] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[39] Tamer Basar,et al. Finite-Sample Analyses for Fully Decentralized Multi-Agent Reinforcement Learning , 2018, ArXiv.

[40] Jalaj Bhandari,et al. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[41] Shalabh Bhatnagar,et al. Two-Timescale Algorithms for Learning Nash Equilibria in General-Sum Stochastic Games , 2015, AAMAS.

[42] Pascal Vincent,et al. Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[43] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[44] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[45] Sébastien Bubeck,et al. Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[46] Alessandro Lazaric,et al. Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[47] Dirk Ormoneit,et al. Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[48] Michael H. Bowling,et al. Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[49] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[50] Benjamin Van Roy,et al. On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[51] H. Kushner. Stochastic approximation: a survey , 2010 .

[52] Matthieu Geist,et al. Softened Approximate Policy Iteration for Markov Games , 2016, ICML.

[53] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[54] A. Y. Mitrophanov,et al. Sensitivity and convergence of uniformly ergodic Markov chains , 2005 .

[55] Theodore J. Perkins,et al. On the Existence of Fixed Points for Q-Learning and Sarsa in Partially Observable Domains , 2002, ICML.

[56] Alessandro Lazaric,et al. LSTD with Random Projections , 2010, NIPS.

[57] Doina Precup,et al. A Convergent Form of Approximate Policy Iteration , 2002, NIPS.