TD Learning of Game Evaluation Functions with Hierarchies Neural Architectures

This Master s thesis describes the e ciency of temporal di erence TD learning and the advantages of using modular neural network architectures for learning game evaluation functions These modular architectures use a hierarchy of gating networks to divide the input space in subspaces for which expert networks are trained This divide and conquer principle might be advantageous when learning game evaluation functions which contain discontinuities and can also lead to more understandable solutions in which strategies can be identi ed and explored We compare the following three modular architectures the hierarchical mixtures of experts the Meta Pi network and the use of xed symbolic rules In order to generate learning samples we combine reinforcement learning with the temporal di erence method When training neural networks with these examples it is possible to learn to play any desired game An extension of normal back propagation has been used in which the sensitivities of neurons are adapted by a learning rule We discuss how these neuron sensitivities can be used to learn discontinuous and smooth game evaluation functions Experiments with the games of tic tac toe and the endgame of backgammon have been performed to compare the hierarchical architectures with a single network and to validate the e ciency of TD learning The results with tic tac toe show that modular architec tures learn faster because independent expert networks can be invoked for evaluating a particular position without the need of invoking one large neural network every time Fur thermore the use of high neuron sensitivities has been proven to be useful when learning discontinuous functions The results with the endgame of backgammon show that TD learning is a viable alternative for supervised learning when only a small learning set is available For both games the performance of the architectures is improved when more games are being played High performance levels can be obtained when a large amount of games are played and the input of the networks contains su cient features keywords Game Playing Modular Neural Networks Extended Back propagation Temporal Di erence Learning Expert Systems Multi strategy Learning Mixtures of Experts Meta Pi network Tic tac toe Backgammon

[1]  Donato Malerba,et al.  Decision Tree Pruning as a Search in the State Space , 1993, ECML.

[2]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[3]  Frédéric Gruau,et al.  Genetic synthesis of Boolean neural networks with a cell rewriting developmental process , 1992, [Proceedings] COGANN-92: International Workshop on Combinations of Genetic Algorithms and Neural Networks.

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Frans C. A. Groen,et al.  The Optimal Number of Learning Samples and Hidden Units in Function Approximation With a Feedforward , 1993 .

[6]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[7]  Volker Tresp,et al.  Network Structuring and Training Using Rule-Based Knowledge , 1992, NIPS.

[8]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[9]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[10]  J Hakalay,et al.  'partition of Unity' Rbf Networks Are Universal Function Approximators , 2022 .

[11]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[12]  J. Stephen Judd,et al.  Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[13]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[14]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[15]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[16]  Alessandro Sperduti,et al.  Speed up learning and network optimization with extended back propagation , 1993, Neural Networks.

[17]  Sherif Hashem,et al.  Optimal Linear Combinations of Neural Networks , 1997, Neural Networks.

[18]  Patrick van der Smagt,et al.  Introduction to neural networks , 1995, The Lancet.

[19]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[20]  P. Dayan,et al.  TD(λ) converges with probability 1 , 2004, Machine Learning.

[21]  M. Anthony Uniform convergence and learnability. , 1991 .

[22]  Hans J. Berliner,et al.  Experiences in Evaluation with BKG - A Program that Plays Backgammon , 1977, IJCAI.

[23]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[24]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[25]  P. Dayan The Convergence of TD(λ) for General λ , 2004, Machine Learning.

[26]  Dieter Fox,et al.  Learning By Error-Driven Decomposition , 1991 .

[27]  Henk Corporaal,et al.  Variations on the Cascade-Correlation Learning Architecture for Fast Convergence in Robot Control , 1992 .

[28]  J. D. Schaffer,et al.  Combinations of genetic algorithms and neural networks: a survey of the state of the art , 1992, [Proceedings] COGANN-92: International Workshop on Combinations of Genetic Algorithms and Neural Networks.

[29]  Terrence J. Sejnowski,et al.  Temporal Difference Learning of Position Evaluation in the Game of Go , 1993, NIPS.

[30]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[31]  Steven Douglas Whitehead,et al.  Reinforcement learning for the adaptive control of perception and action , 1992 .