Combiner connaissances expertes, hors-ligne, transientes et en ligne pour l'exploration Monte-Carlo. Apprentissage et MC

We combine for Monte-Carlo exploration machine learning at four different time scales: - online regret, through the use of bandit algorithms and Monte-Carlo estimates; - transient learning, through the use of rapid action value estimates (RA VE) which are learnt online and used for accelerating the exploration and are thereafter neglected; - offline learning, by data mining ofdatasets of games; - use of expert knowledge coming from the old ages as prior information. The resulting algorithm is stronger than each element separately. We finally emphasize the exploration-exploitation dilemna in the Monte-Carlo simulations and show great improvements that can be reached with a fine tuning of related constants.

[1]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[2]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[3]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[4]  Sylvain Gelly,et al.  Modifications of UCT and sequence-like simulations for Monte-Carlo Go , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[5]  Bruno Bouzy,et al.  Monte-Carlo strategies for computer Go , 2006 .

[6]  Bruno Bouzy,et al.  Associating domain-dependent knowledge and Monte Carlo approaches within a Go program , 2005, Inf. Sci..

[7]  Lin Wu,et al.  SVM and pattern-enriched common fate graphs for the game of go , 2005, ESANN.

[8]  Bruno Bouzy,et al.  Bayesian Generation and Integration of K-nearest-neighbor Patterns for 19x19 Go , 2005, CIG.

[9]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[10]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[12]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[13]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[14]  Thomas P. Hayes,et al.  Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[15]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[16]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[17]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.