Convergence Results for Some Temporal Difference Methods Based on Least Squares

We consider finite-state Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE(lambda ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinite-horizon discounted and average cost dynamic programming. We introduce an average cost method, patterned after the known discounted cost method, and we prove its convergence for a range of constant stepsize choices. We also show that the convergence rate of both the discounted and the average cost methods is optimal within the class of temporal difference methods. Analysis and experiment indicate that our methods are substantially and often dramatically faster than TD(lambda), as well as more reliable.

[1]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[2]  Stephen S. Wilson,et al.  Random iterative models , 1996 .

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[4]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[5]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[6]  Fernando J. Pineda,et al.  Mean-Field Theory for Batched TD() , 1997, Neural Computation.

[7]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[8]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[9]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[10]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[11]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[12]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[13]  Benjamin Van Roy,et al.  On the existence of fixed points for approximate value iteration and temporal-difference learning , 2000 .

[14]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[15]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[16]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[17]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[18]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[19]  John N. Tsitsiklis,et al.  On Average Versus Discounted Reward Temporal-Difference Learning , 2002, Machine Learning.

[20]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[21]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[22]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[23]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[24]  Huizhen Yu,et al.  A Function Approximation Approach to Estimation of Policy Gradient for POMDP with Structured Policies , 2005, UAI.

[25]  Vivek S. Borkar,et al.  Stochastic approximation with 'controlled Markov' noise , 2006, Systems & control letters (Print).

[26]  D. Bertsekas,et al.  A Least Squares Q-Learning Algorithm for Optimal Stopping Problems , 2007 .

[27]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[28]  Dimitri P. Bertsekas,et al.  New error bounds for approximations from projected linear equations , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[29]  D. Bertsekas,et al.  New error bounds for approximations from projected linear equations , 2008, Allerton 2008.

[30]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[31]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..