LinUCB Applied to Monte-Carlo Tree Search

UCT is a de facto standard method for Monte-Carlo tree search (MCTS) algorithms, which have been applied to various domains and have achieved remarkable success. This study proposes a family of LinUCT algorithms that incorporate LinUCB into MCTS algorithms. LinUCB is a recently developed method that generalizes past episodes by ridge regression with feature vectors and rewards. LinUCB outperforms UCB1 in contextual multi-armed bandit problems. We introduce a straightforward application of LinUCB, \(\text {LinUCT}_{\text {PLAIN}}\) by substituting UCB1 with LinUCB in UCT. We show that it does not work well owing to the minimax structure of game trees. To better handle such tree structures, we present \(\text {LinUCT}_{\text {RAVE}}\) and \(\text {LinUCT}_{\text {FP}}\) by further incorporating two existing techniques, rapid action value estimation (RAVE) and feature propagation, which recursively propagates the feature vector of a node to that of its parent. Experiments were conducted with a synthetic model, which is an extension of the standard incremental random tree model in which each node has a feature vector that represents the characteristics of the corresponding position. The experimental results indicate that \(\text {LinUCT}_{\text {RAVE}}\), \(\text {LinUCT}_{\text {FP}}\), and their combination \(\text {LinUCT}_{\text {RAVE-FP}}\) outperform UCT, especially when the branching factor is relatively large.

[1]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[2]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[3]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[4]  Richard E. Korf,et al.  Best-First Minimax Search , 1996, Artif. Intell..

[5]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[6]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[7]  Thomas J. Walsh,et al.  Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[8]  Donald E. Knuth,et al.  The Solution for the Branching Factor of the Alpha-Beta Pruning Algorithm , 1981, ICALP.

[9]  Dana S. Nau,et al.  An Analysis of Forward Pruning , 1994, AAAI.

[10]  Akihiro Kishimoto,et al.  Scalable Distributed Monte-Carlo Tree Search , 2011, SOCS.

[11]  Gerald Tesauro,et al.  Monte-Carlo simulation balancing , 2009, ICML '09.

[12]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[13]  Tomoyuki Kaneko,et al.  Large-Scale Optimization for Evaluation Functions with Minimax Search , 2014, J. Artif. Intell. Res..

[14]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[15]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[16]  Christopher D. Rosin,et al.  Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[17]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[18]  Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011 , 2011, AISTATS.

[19]  Bruno Bouzy,et al.  Computer Go: An AI oriented survey , 2001, Artif. Intell..

[20]  Michael Buro,et al.  Minimum Proof Graphs and Fastest-Cut-First Search Heuristics , 2009, IJCAI.

[21]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.