Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1

We introduce a new policy iteration method for dynamic programming problems with discounted and undiscounted cost. The method is based on the notion of temporal dierences, and is primarily geared to the case of large and complex problems where the use of approximations is essential. We develop the theory of the method without approximation, we describe how to embed it within a neuro-dynamic programming/reinforcement learning context where feature-based approximation architectures are used, we relate it to TD( ) methods, and we illustrate its use in the training of a tetris playing program.

[1]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[2]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[3]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.