Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming

We introduce a new policy iteration method for dynamic programming problems with discounted and undiscounted cost. The method is based on the notion of temporal differences, and is primarily geared to the case of large and complex problems where the use of approximations is essential. We develop the theory of the method without approximation, we describe how to embed it within a neuro-dynamic programming/reinforcement learning context where feature-based approximation architectures are used, we relate it to TD(λ) methods, and we illustrate its use in the training of a tetris playing program. 1 Supported by the National Science Foundation under Grant DDM-8903385 and Grant CCR9103804. Thanks are due to John Tsitsiklis for several helpful discussions and to Dimitris Papaioannou, who assisted with some of the experiments. 2 Department of Electrical Engineering and Computer Science, M. I. T., Cambridge, Mass., 02139. 3 Department of Electrical Engineering and Computer Science, M. I. T., Cambridge, Mass., 02139. 1

[1]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[2]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  Heidi Burgiel,et al.  How to lose at Tetris , 1997, The Mathematical Gazette.