论文信息 - TD-Gammon: A Self-Teaching Backgammon Program

TD-Gammon: A Self-Teaching Backgammon Program

This chapter describes TD-Gammon, a neural network that is able to teach itself to play backgammon solely by playing against itself and learning from the results. TD-Gammon uses a recently proposed reinforcement learning algorithm called TD(λ) (Sutton, 1988), and is apparently the first application of this algorithm to a complex nontrivial task. Despite starting from random initial weights (and hence random initial strategy), TD-Gammon achieves a surprisingly strong level of play. With zero knowledge built in at the start of learning (i.e. given only a “raw” description of the board state), the network learns to play the entire game at a strong intermediate level that surpasses not only conventional commercial programs, but also comparable networks trained via supervised learning on a large corpus of human expert games. The hidden units in the network have apparently discovered useful features, a longstanding goal of computer games research.

Gerald Tesauro | G. Tesauro

[1] A. L. Samuel,et al. Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[2] Arnold K. Griffith,et al. A Comparison and Evaluation of Three Machine Learning Procedures as Applied to the Game of Checkers , 1974, Artif. Intell..

[3] Norman Zadeh,et al. On Optimal Doubling in Backgammon , 1977 .

[4] J. Ross Quinlan,et al. Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[5] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[6] P W Frey,et al. Algorithmic strategies for improving the performance of game-playing programs , 1986 .

[7] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[8] Richard E. Korf,et al. A Unified Theory of Heuristic Evaluation Functions and its Application to Learning , 1986, AAAI.

[9] James L. McClelland,et al. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[10] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[11] Gerald Tesauro,et al. Connectionist Learning of Expert Preferences by Comparison Training , 1988, NIPS.

[12] Sanjoy Mahajan,et al. A Pattern Classification Approach to Evaluation Function Learning , 1988, Artif. Intell..

[13] Terrence J. Sejnowski,et al. A Parallel Network that Learns to Play Backgammon , 1989, Artif. Intell..

[14] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[15] Gerald Tesauro,et al. Neurogammon: a neural-network backgammon program , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[16] Gerald Tesauro,et al. Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..