On Effective Parallelization of Monte Carlo Tree Search

Despite its groundbreaking success in Go and computer games, Monte Carlo Tree Search (MCTS) is computationally expensive as it requires a substantial number of rollouts to construct the search tree, which calls for effective parallelization. However, how to design effective parallel MCTS algorithms has not been systematically studied and remains poorly understood. In this paper, we seek to lay its first theoretical foundations, by examining the potential performance loss caused by parallelization when achieving a desired speedup. In particular, we focus on studying the conditions under which the performance loss (measured in excess regret) vanishes over time. To this end, we propose a general parallel MCTS framework that can be specialized to major existing parallel MCTS algorithms. We derive two necessary conditions for the algorithms covered by the general framework to have vanishing excess regret (i.e. excess regret converges to zero as the total number of rollouts grows). We demonstrate the effectiveness of the necessary conditions by showing that, for depth-2 search trees, the recently developed WU-UCT algorithm satisfies both necessary conditions and has provable vanishing excess regret. Finally, we perform empirical studies to closely examine the necessary conditions under the general tree search setting (with arbitrary tree depth). It shows that the topological discrepancy between the search trees constructed by the parallel and the sequential MCTS algorithms is the main reason for the performance loss.

[1]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[2]  Mark H. M. Winands,et al.  Minimizing Simple and Cumulative Regret in Monte-Carlo Tree Search , 2014, CGW@ECAI.

[3]  Aurélien Garivier,et al.  On Bayesian Upper Confidence Bounds for Bandit Problems , 2012, AISTATS.

[4]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[5]  T. Cazenave,et al.  On the Parallelization of UCT , 2007 .

[6]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[7]  Carmel Domshlak,et al.  Simple Regret Optimization in Online Planning for Markov Decision Processes , 2012, J. Artif. Intell. Res..

[8]  H. Jaap van den Herik,et al.  Parallel Monte-Carlo Tree Search , 2008, Computers and Games.

[9]  S. Shankar Sastry,et al.  A Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes , 2017, ArXiv.

[10]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[11]  Honglak Lee,et al.  Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning , 2014, NIPS.

[12]  Yu Zhai,et al.  Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search , 2020, ICLR.

[13]  Jan Willemson,et al.  Improved Monte-Carlo Search , 2006 .

[14]  V. V. Buldygin,et al.  Sub-Gaussian random variables , 1980 .

[15]  P. Cowling,et al.  Determinization in Monte-Carlo Tree Search for the card game , 2011 .

[16]  Carmel Domshlak,et al.  On MABs and Separation of Concerns in Monte-Carlo Planning for MDPs , 2014, ICAPS.

[17]  Ikuo Takeuchi,et al.  Parallel Monte-Carlo Tree Search with Simulation Servers , 2010, 2010 International Conference on Technologies and Applications of Artificial Intelligence.

[18]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[19]  Akihiro Kishimoto,et al.  Scalable Distributed Monte-Carlo Tree Search , 2011, SOCS.

[20]  Osamu Watanabe,et al.  Evaluating Root Parallelization in Go , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[21]  David Tolpin,et al.  Selecting Computations: Theory and Applications , 2012, UAI.

[22]  Thomas Hérault,et al.  Scalability and Parallelization of Monte-Carlo Tree Search , 2010, Computers and Games.

[23]  Erik Ragnar Poromaa Crushing Candy Crush : Predicting Human Success Rate in a Mobile Game using Monte-Carlo Tree Search , 2017 .

[24]  Sylvain Gelly,et al.  Exploration exploitation in Go: UCT for Monte-Carlo Go , 2006, NIPS 2006.

[25]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[26]  Rémi Munos,et al.  Pure exploration in finitely-armed and continuous-armed bandits , 2011, Theor. Comput. Sci..

[27]  Eshcar Hillel,et al.  Distributed Exploration in Multi-Armed Bandits , 2013, NIPS.

[28]  Wouter M. Koolen,et al.  Monte-Carlo Tree Search by Best Arm Identification , 2017, NIPS.

[29]  Qing Zhao,et al.  Distributed Learning in Multi-Armed Bandit With Multiple Players , 2009, IEEE Transactions on Signal Processing.

[30]  David Silver,et al.  Monte-Carlo tree search and rapid action value estimation in computer Go , 2011, Artif. Intell..

[31]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[32]  Sam Devlin,et al.  Combining Gameplay Data with Monte Carlo Tree Search to Emulate Human Play , 2016, AIIDE.

[33]  Yelong Shen,et al.  M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search , 2018, NeurIPS.

[34]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[35]  Varun Kanade,et al.  Decentralized Cooperative Stochastic Bandits , 2018, NeurIPS.

[36]  Yoshimasa Tsuruoka,et al.  Regulation of exploration for simple regret minimization in Monte-Carlo tree search , 2015, 2015 IEEE Conference on Computational Intelligence and Games (CIG).

[37]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[38]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[39]  Nicolas Jouandeau,et al.  A Parallel Monte-Carlo Tree Search Algorithm , 2008, Computers and Games.

[40]  David Tolpin,et al.  MCTS Based on Simple Regret , 2012, AAAI.

[41]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.