LinUCB applied to Monte Carlo tree search

UCT is a standard method of Monte Carlo tree search (MCTS) algorithms, which have been applied to various domains and have achieved remarkable success. This study proposes a family of LinUCT algorithms that incorporate LinUCB into MCTS algorithms. LinUCB is a recently developed method that generalizes past episodes by ridge regression with feature vectors and rewards. LinUCB outperforms UCB1 in contextual multi-armed bandit problems. We introduce a straightforward application of LinUCB, LinUCT PLAIN by substituting UCB1 with LinUCB in UCT. We show that it does not work well owing to the minimax structure of game trees. To better handle such tree structures, we present LinUCT RAVE and LinUCT FP by further incorporating two existing techniques, rapid action value estimation (RAVE) and feature propagation, which recursively propagates the feature vector of a node to that of its parent. Experiments were conducted with a synthetic model, which is an extension of the standard incremental random tree model in which each node has a feature vector that represents the characteristics of the corresponding position, and Finnsson's shock step game which is used to empirically analyze the performance of UCT with respect to the distribution of suboptimal moves. The experiments results indicate that LinUCT RAVE and LinUCT FP outperform UCT, especially when the branching factor is relatively large.

[1]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[2]  Akihiro Kishimoto,et al.  Scalable Distributed Monte-Carlo Tree Search , 2011, SOCS.

[3]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.

[4]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[5]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[6]  Michael Buro,et al.  From Simple Features to Sophisticated Evaluation Functions , 1998, Computers and Games.

[7]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[8]  Thomas J. Walsh,et al.  Exploring compact reinforcement-learning representations with linear regression , 2009, UAI.

[9]  Dana S. Nau,et al.  An Analysis of Forward Pruning , 1994, AAAI.

[10]  David Silver,et al.  Combining online and offline knowledge in UCT , 2007, ICML '07.

[11]  Bruno Bouzy,et al.  Computer Go: An AI oriented survey , 2001, Artif. Intell..

[12]  Richard E. Korf,et al.  Best-First Minimax Search , 1996, Artif. Intell..

[13]  Michael Buro,et al.  Minimum Proof Graphs and Fastest-Cut-First Search Heuristics , 2009, IJCAI.

[14]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[15]  Tomoyuki Kaneko,et al.  Large-Scale Optimization for Evaluation Functions with Minimax Search , 2014, J. Artif. Intell. Res..

[16]  Gerald Tesauro,et al.  Monte-Carlo simulation balancing , 2009, ICML '09.

[17]  Rémi Coulom,et al.  Computing "Elo Ratings" of Move Patterns in the Game of Go , 2007, J. Int. Comput. Games Assoc..

[18]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[19]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[20]  Bruno Bouzy,et al.  Monte-Carlo Go Developments , 2003, ACG.

[21]  Christopher D. Rosin,et al.  Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[22]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[23]  Tomoyuki Kaneko,et al.  LinUCB Applied to Monte-Carlo Tree Search , 2015, ACG.

[24]  Rémi Munos,et al.  Online gradient descent for least squares regression: Non-asymptotic bounds and application to bandits , 2013, ArXiv.

[25]  Donald E. Knuth,et al.  The Solution for the Branching Factor of the Alpha-Beta Pruning Algorithm , 1981, ICALP.