Least Squares Temporal Difference Methods: An Analysis under General Conditions

We consider approximate policy evaluation for finite state and action Markov de- cision processes (MDP) with the least squares temporal difference (LSTD) algorithm, LSTD(λ), in an exploration-enhanced learning context, where policy costs are computed from observations of a Markov chain different from the one corresponding to the policy under evaluation. We establish for the discounted cost criterion that LSTD(λ) converges almost surely under mild, minimal conditions. We also analyze other properties of the iterates involved in the algorithm, including convergence in mean and boundedness. Our analysis draws on theories of both finite space Markov chains and weak Feller Markov chains on a topological space. Our results can be applied to other temporal difference algorithms and MDP models. As examples, we give a convergence analysis of a TD(λ) algorithm and extensions to MDP with compact state and action spaces, as well as a convergence proof of a new LSTD algorithm with state-dependent λ-parameters.

[1]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[2]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[3]  Vivek S. Borkar,et al.  Stochastic approximation with 'controlled Markov' noise , 2006, Systems & control letters (Print).

[4]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[5]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  S. Meyn Ergodic theorems for discrete time stochastic systems using a stochastic lyapunov function , 1989 .

[8]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[11]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[12]  D. Bertsekas,et al.  Weighted Bellman Equations and their Applications in Approximate Dynamic Programming ∗ , 2012 .

[13]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[14]  Richard S. Sutton,et al.  GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[15]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[16]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[17]  R. S. Randhawa,et al.  Combining importance sampling and temporal difference control variates to simulate Markov Chains , 2004, TOMC.

[18]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[19]  Dimitri P. Bertsekas,et al.  Temporal Difference Methods for General Projected Equations , 2011, IEEE Transactions on Automatic Control.

[20]  R. Cooke Real and Complex Analysis , 2011 .

[21]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[22]  R. Sutton,et al.  GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010 .

[23]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[24]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[25]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[26]  Aarnout Brombacher,et al.  Probability... , 2009, Qual. Reliab. Eng. Int..

[27]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[28]  Zhi-Qiang Liu,et al.  Preconditioned temporal difference learning , 2008, ICML '08.

[29]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[30]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[31]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[32]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[33]  D. Bertsekas Projected Equations, Variational Inequalities, and Temporal Difference Methods , 2009 .

[34]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[35]  Alessandro Lazaric,et al.  Finite-sample analysis of least-squares policy iteration , 2012, J. Mach. Learn. Res..

[36]  Vivek S. Borkar,et al.  Adaptive Importance Sampling Technique for Markov Chains Using Stochastic Approximation , 2006, Oper. Res..

[37]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[38]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .