Least Squares Policy Evaluation Algorithms with Linear Function Approximation

We consider policy evaluation algorithms within the context of infinite-horizon dynamic programming problems with discounted cost. We focus on discrete-time dynamic systems with a large number of states, and we discuss two methods, which use simulation, temporal differences, and linear cost function approximation. The first method is a new gradient-like algorithm involving least-squares subproblems and a diminishing stepsize, which is based on the λ-policy iteration method of Bertsekas and Ioffe. The second method is the LSTD(λ) algorithm recently proposed by Boyan, which for λ=0 coincides with the linear least-squares temporal-difference algorithm of Bradtke and Barto. At present, there is only a convergence result by Bradtke and Barto for the LSTD(0) algorithm. Here, we strengthen this result by showing the convergence of LSTD(λ), with probability 1, for every λ ∈ [0, 1].

[1]  Emanuel Parzen,et al.  Modern Probability Theory And Its Applications , 1962 .

[2]  L. Sucheston Modern Probability Theory and its Applications. , 1961 .

[3]  J. Neveu,et al.  Discrete Parameter Martingales , 1975 .

[4]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[5]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[6]  Robert G. Gallager,et al.  Discrete Stochastic Processes , 1995 .

[7]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[10]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[11]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[12]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[13]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[14]  Michael I. Jordan,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001 .

[15]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[16]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[17]  Terrence J. Sejnowski,et al.  TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[18]  Vladislav Tadic,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001, Machine Learning.

[19]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[20]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[21]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .