Universal parameter optimisation in games based on SPSA

Most game programs have a large number of parameters that are crucial for their performance. While tuning these parameters by hand is rather difficult, efficient and easy to use generic automatic parameter optimisation algorithms are known only for special problems such as the adjustment of the parameters of an evaluation function. The SPSA algorithm (Simultaneous Perturbation Stochastic Approximation) is a generic stochastic gradient method for optimising an objective function when an analytic expression of the gradient is not available, a frequent case in game programs. Further, SPSA in its canonical form is very easy to implement. As such, it is an attractive choice for parameter optimisation in game programs, both due to its generality and simplicity. The goal of this paper is twofold: (i) to introduce SPSA for the game programming community by putting it into a game-programming perspective, and (ii) to propose and discuss several methods that can be used to enhance the performance of SPSA. These methods include using common random numbers and antithetic variables, a combination of SPSA with RPROP, and the reuse of samples of previous performance evaluations. SPSA with the proposed enhancements was tested in some large-scale experiments on tuning the parameters of an opponent model, a policy and an evaluation function in our poker program, MCRAISE. Whilst SPSA with no enhancements failed to make progress using the allocated resources, SPSA with the enhancements proved to be competitive with other methods, including TD-learning; increasing the average payoff per game by as large as 0.19 times the size of the amount of the small bet. From the experimental study, we conclude that the use of an appropriately enhanced variant of SPSA for the optimisation of game program parameters is a viable approach, especially if no good alternative exist for the types of parameters considered.

[1]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[2]  J. Blum Multidimensional Stochastic Approximation Methods , 1954 .

[3]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[4]  Sid Sackson,et al.  A Gamut of Games , 1969 .

[5]  Selim G. Akl,et al.  The principal continuation and the killer heuristic , 1977, ACM '77.

[6]  T. Anthony Marsland,et al.  Parallel Search of Strongly Ordered Game Trees , 1982, CSUR.

[7]  R. Rubinstein,et al.  Antithetic Variates, Multivariate Dependence and Simulation of Stochastic Systems , 1985 .

[8]  Hung Chen Lower Rate of Convergence for Locating a Maximum of a Function , 1988 .

[9]  李幼升,et al.  Ph , 1989 .

[10]  P. Glasserman,et al.  Some Guidelines and Guarantees for Common Random Numbers , 1992 .

[11]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[12]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[13]  Jonathan Schaeffer,et al.  New advances in Alpha-Beta searching , 1996, CSC '96.

[14]  H. Jaap van den Herik,et al.  Replacement Schemes and Two-Level Tables , 1996, J. Int. Comput. Games Assoc..

[15]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[16]  J. Spall,et al.  Optimal random perturbations for stochastic approximation using a simultaneous perturbation gradient approximation , 1997, Proceedings of the 1997 American Control Conference (Cat. No.97CH36041).

[17]  Gang George Yin,et al.  Budget-Dependent Convergence Rate of Stochastic Approximation , 1995, SIAM J. Optim..

[18]  Sigrún Andradóttir,et al.  A review of simulation optimization techniques , 1998, 1998 Winter Simulation Conference. Proceedings (Cat. No.98CH36274).

[19]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[20]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[21]  David B. Fogel,et al.  Evolving neural networks to play checkers without relying on expert knowledge , 1999, IEEE Trans. Neural Networks.

[22]  Ernst A. Heinz Adaptive Null-Move Pruning , 1999, J. Int. Comput. Games Assoc..

[23]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[24]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[25]  J. Spall,et al.  Simulation-Based Optimization with Stochastic Approximation Using Common Random Numbers , 1999 .

[26]  László Gerencsér,et al.  Optimization over discrete sets via SPSA , 1999, WSC '99.

[27]  László Gerencsér,et al.  Non-smooth optimization via SPSA , 1999 .

[28]  James C. Spall,et al.  Adaptive stochastic approximation by the simultaneous perturbation method , 2000, IEEE Trans. Autom. Control..

[29]  M. Winands Informed Search in Complex Games , 2000 .

[30]  Christian Igel,et al.  Improving the Rprop Learning Algorithm , 2000 .

[31]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[32]  Jos W. H. M. Uiterwijk,et al.  Temporal Difference Learning and the Neural MoveMap Heuristic in the Game of Lines of Action , 2002 .

[33]  Manuela Veloso,et al.  Scalable Learning in Stochastic Games , 2002 .

[34]  Jonathan Schaeffer,et al.  The challenge of poker , 2002, Artif. Intell..

[35]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[36]  Takashi Chikayama,et al.  Game-tree Search Algorithm based on Realization Probability , 2002, J. Int. Comput. Games Assoc..

[37]  Michael C. Fu,et al.  Randomized-direction stochastic approximation algorithms using deterministic sequences , 2002, Proceedings of the Winter Simulation Conference.

[38]  Nicol N. Schraudolph,et al.  Towards stochastic conjugate gradient methods , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[39]  Nicol N. Schraudolph,et al.  Conjugate Directions for Stochastic Gradient Descent , 2002, ICANN.

[40]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[41]  T. Anthony Marsland,et al.  Learning extension parameters in game-tree search , 2003, Inf. Sci..

[42]  Michael C. Fu,et al.  Convergence of simultaneous perturbation stochastic approximation for nondifferentiable optimization , 2003, IEEE Trans. Autom. Control..

[43]  Christian Igel,et al.  Empirical evaluation of the improved Rprop learning algorithms , 2003, Neurocomputing.

[44]  H. Jaap van den Herik,et al.  Two Learning Algorithms for Forward Pruning , 2003, J. Int. Comput. Games Assoc..

[45]  Jonathan Schaeffer,et al.  Approximating Game-Theoretic Optimal Strategies for Full-scale Poker , 2003, IJCAI.

[46]  H. Jaap van den Herik,et al.  An Evaluation Function for Lines of Action , 2003, ACG.

[47]  Levente Kocsis Learning search decisions , 2003 .

[48]  Accelerated randomized stochastic optimization , 2003 .

[49]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[50]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[51]  H. Jaap van den Herik,et al.  The Relative History Heuristic , 2004, Computers and Games.

[52]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[53]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[54]  Jonathan Schaeffer,et al.  Game-Tree Search with Adaptation in Stochastic Imperfect-Information Games , 2004, Computers and Games.

[55]  Andrew Tridgell,et al.  Learning to Play Chess Using Temporal Differences , 2000, Machine Learning.

[56]  Csaba Szepesvari,et al.  Reduced-Variance Payoff Estimation in Adversarial Bandit Problems , 2005 .

[57]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[58]  George D. Magoulas,et al.  New globally convergent training scheme based on the resilient propagation algorithm , 2005, Neurocomputing.

[59]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[60]  Eric Moulines,et al.  Comparison of resampling schemes for particle filtering , 2005, ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005..

[61]  James heiler,et al.  On the choice of random directions for stochastic approximation algorithms , 2006, IEEE Transactions on Automatic Control.

[62]  Csaba Szepesvári,et al.  RSPSA: Enhanced Parameter Optimization in Games , 2006, ACG.

[63]  H. Robbins A Stochastic Approximation Method , 1951 .

[64]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .