Practical Issues in Temporal Difference Learning

This paper examines whether temporal difference methods for training connectionist networks, such as Suttons's TD(λ) algorithm, can be successfully applied to complex real-world problems. A number of important practical issues are identified and discussed from a general theoretical perspective. These practical issues are then examined in the context of a case study in which TD(λ) is applied to learning the game of backgammon from the outcome of self-play. This is apparently the first application of this algorithm to a complex nontrivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact surpasses comparable networks trained on a massive human expert data set. The hidden units in these network have apparently discovered useful features, a longstanding goal of computer games research. Furthermore, when a set of hand-crafted features is added to the input representation, the resulting networks reach a near-expert level of performance, and have achieved good results against world-class human play.

[1]  P W Frey,et al.  Algorithmic strategies for improving the performance of game-playing programs , 1986 .

[2]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[3]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Sanjoy Mahajan,et al.  A Pattern Classification Approach to Evaluation Function Learning , 1988, Artif. Intell..

[5]  Arnold K. Griffith,et al.  A Comparison and Evaluation of Three Machine Learning Procedures as Applied to the Game of Checkers , 1974, Artif. Intell..

[6]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[7]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[8]  Norman Zadeh,et al.  On Optimal Doubling in Backgammon , 1977 .

[9]  Hans J. Berliner,et al.  On the Construction of Evaluation Functions for Large Domains , 1979, IJCAI.

[10]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[11]  Paul E. Utgoff,et al.  Two Kinds of Training Information For Evaluation Function Learning , 1991, AAAI.

[12]  Charles W. Anderson,et al.  Strategy Learning with Multilayer Connectionist Representations , 1987 .

[13]  Hans J. Berliner,et al.  Experiences in Evaluation with BKG - A Program that Plays Backgammon , 1977, IJCAI.

[14]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers" in computers and thought eds , 1995 .

[15]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[16]  Richard E. Korf,et al.  A Unified Theory of Heuristic Evaluation Functions and its Application to Learning , 1986, AAAI.

[17]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[18]  B. Widrow,et al.  Stationary and nonstationary learning characteristics of the LMS adaptive filter , 1976, Proceedings of the IEEE.

[19]  Gerald Tesauro,et al.  Connectionist Learning of Expert Preferences by Comparison Training , 1988, NIPS.

[20]  Gerald Tesauro,et al.  Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[21]  Terrence J. Sejnowski,et al.  A Parallel Network that Learns to Play Backgammon , 1989, Artif. Intell..

[22]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[23]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .