The convergence of TD(λ) for general λ

The method of temporal differences (TD) is one way of making consistent predictions about the future. This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from the case which only uses information from adjacent time steps to that involving information from arbitrary ones.It also considers how this version of TD behaves in the face of linearly dependent representations for states—demonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally it adapts Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges with probability one, to demonstrate this strong form of convergence for a slightly modified version of TD.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  J. Gillis,et al.  Matrix Iterative Analysis , 1961 .

[3]  E. Feigenbaum,et al.  Computers and Thought , 1963 .

[4]  P. B. Coaker,et al.  Applied Dynamic Programming , 1964 .

[5]  R. Bellman Dynamic programming. , 1957, Science.

[6]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[7]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[8]  James S. Albus,et al.  New Approach to Manipulator Control: The Cerebellar Model Articulation Controller (CMAC)1 , 1975 .

[9]  S. Ostrach,et al.  Heat Transfer Augmentation in Laminar Fully Developed Channel Flow by Means of Heating From Below , 1975 .

[10]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Inf. Control..

[11]  John S. Edwards,et al.  The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence , 1983 .

[12]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[13]  Steven Edward Hampson,et al.  A neural model of adaptive behavior , 1983 .

[14]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[15]  John R. Anderson,et al.  Machine learning - an artificial intelligence approach , 1982, Symbolic computation.

[16]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[17]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[18]  Bart W. Stuck,et al.  A Computer and Communication Network Performance Analysis Primer (Prentice Hall, Englewood Cliffs, NJ, 1985; revised, 1987) , 1987, Int. CMG Conference.

[19]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[20]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[21]  S. Hampson Connectionistic Problem Solving: Computational Aspects of Biological Learning Steven E. Hampson Birkhäuser, 1990. Sw. fr. 78.00 (iv + 276 pages) ISBN 3 7643 3450 9 , 1990, Trends in Neurosciences.

[22]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[23]  Paul J. Werbos,et al.  Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[24]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[25]  P. Dayan Reinforcing connectionism : learning the statistical way , 1991 .

[26]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[27]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.