论文信息 - A Counterexample to Temporal Differences Learning

A Counterexample to Temporal Differences Learning

Sutton's TD() method aims to provide a representation of the cost function in an absorbing Markov chain with transition costs. A simple example is given where the representation obtained depends on . For = 1 the representation is optimal with respect to a least-squares error criterion, but as decreases toward 0 the representation becomes progressively worse and, in some cases, very poor. The example suggests a need to understand better the circumstances under which TD(0) and Q-learning obtain satisfactory neural network-based compact representations of the cost function. A variation of TD(0) is also given, which performs better on the example.

Dimitri P. Bertsekas | D. Bertsekas

[1] R. Bellman. Dynamic programming. , 1957, Science.

[2] Åke Björck,et al. Numerical Methods , 1995, Handbook of Marine Craft Hydrodynamics and Motion Control.

[3] Harold J. Kushner,et al. wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[4] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[5] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[6] Zhi-Quan Luo,et al. On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks , 1991, Neural Computation.

[7] Gerald Tesauro,et al. Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[8] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[9] Luo Zhi-quan,et al. Analysis of an approximate gradient projection method with applications to the backpropagation algorithm , 1994 .

[10] A. Harry Klopf,et al. Advantage Updating Applied to a Differrential Game , 1994, NIPS.

[11] O. Mangasarian,et al. Serial and parallel backpropagation convergence via nonmonotone perturbed minimization , 1994 .

[12] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[13] Andrew G. Barto,et al. Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[14] O. Nelles,et al. An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[15] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.