论文信息 - Combiner connaissances expertes, hors-ligne, transientes et en ligne pour l'exploration Monte-Carlo. Apprentissage et MC

Combiner connaissances expertes, hors-ligne, transientes et en ligne pour l'exploration Monte-Carlo. Apprentissage et MC

We combine for Monte-Carlo exploration machine learning at four different time scales: - online regret, through the use of bandit algorithms and Monte-Carlo estimates; - transient learning, through the use of rapid action value estimates (RA VE) which are learnt online and used for accelerating the exploration and are thereafter neglected; - offline learning, by data mining ofdatasets of games; - use of expert knowledge coming from the old ages as prior information. The resulting algorithm is stronger than each element separately. We finally emphasize the exploration-exploitation dilemna in the Monte-Carlo simulations and show great improvements that can be reached with a fine tuning of related constants.

[1] R. Agrawal. The Continuum-Armed Bandit Problem , 1995 .

[2] Zoubin Ghahramani,et al. Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[3] H. Jaap van den Herik,et al. Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[4] Sylvain Gelly,et al. Modifications of UCT and sequence-like simulations for Monte-Carlo Go , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[5] Bruno Bouzy,et al. Monte-Carlo strategies for computer Go , 2006 .

[6] Bruno Bouzy,et al. Associating domain-dependent knowledge and Monte Carlo approaches within a Go program , 2005, Inf. Sci..

[7] Lin Wu,et al. SVM and pattern-enriched common fate graphs for the game of go , 2005, ESANN.

[8] Bruno Bouzy,et al. Bayesian Generation and Integration of K-nearest-neighbor Patterns for 19x19 Go , 2005, CIG.

[9] Denyse Baillargeon,et al. Bibliographie , 1929 .

[10] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[11] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[12] Rémi Munos,et al. Bandit Algorithms for Tree Search , 2007, UAI.

[13] Rémi Coulom,et al. Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[14] Thomas P. Hayes,et al. Robbing the bandit: less regret in online geometric optimization against an adaptive adversary , 2006, SODA '06.

[15] David Silver,et al. Combining online and offline knowledge in UCT , 2007, ICML '07.

[16] H. Robbins,et al. Asymptotically efficient adaptive allocation rules , 1985 .

[17] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.