Scaling Monte Carlo Tree Search on Intel Xeon Phi

Many algorithms have been parallelized successfully on the Intel Xeon Phi coprocessor, especially those with regular, balanced, and predictable data access patterns and instruction flows. Irregular and unbalanced algorithms are harder to parallelize efficiently. They are, for instance, present in artificial intelligence search algorithms such as Monte Carlo Tree Search (MCTS). In this paper we study the scaling behavior of MCTS, on a highly optimized real-world application, on real hardware. The Intel Xeon Phi allows shared memory scaling studies up to 61 cores and 244 hardware threads. We compare work-stealing (Cilk Plus and TBB) and work-sharing (FIFO scheduling) approaches. Interestingly, we find that a straightforward thread pool with a work-sharing FIFO queue shows the best performance. A crucial element for this high performance is the controlling of the grain size, an approach that we call Grain Size Controlled Parallel MCTS. Our subsequent comparing with the Xeon CPUs shows an even more comprehensible distinction in performance between different threading libraries. We achieve, to the best of our knowledge, the fastest implementation of a parallel MCTS on the 61 core (= 244 hardware threads) Intel Xeon Phi using a real application (47 times faster than a sequential run).

[1]  H. Jaap van den Herik,et al.  Progressive Strategies for Monte-Carlo Tree Search , 2008 .

[2]  Ümit V. Çatalyürek,et al.  An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[3]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[4]  H. Jaap van den Herik,et al.  Combining Simulated Annealing and Monte Carlo Tree Search for Expression Simplification , 2013, ICAART.

[5]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[6]  Akihiro Kishimoto,et al.  Scalable Distributed Monte-Carlo Tree Search , 2011, SOCS.

[7]  H. Jaap van den Herik,et al.  Performance analysis of a 240 thread tournament level MCTS Go program on the Intel Xeon Phi , 2014, ArXiv.

[8]  Ryan B. Hayward,et al.  Monte Carlo Tree Search in Hex , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[9]  Arch D. Robison,et al.  Intel® Threading Building Blocks (TBB) , 2011, Encyclopedia of Parallel Computing.

[10]  Zvi Galil,et al.  Data structures and algorithms for disjoint set union problems , 1991, CSUR.

[11]  H. Jaap van den Herik,et al.  Improving multivariate Horner schemes with Monte Carlo tree search , 2012, Comput. Phys. Commun..

[12]  Bradford Nichols,et al.  Pthreads programming , 1996 .

[13]  Yuxiong He,et al.  The Cilkview scalability analyzer , 2010, SPAA '10.

[14]  Rezaur Rahman,et al.  Intel Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers , 2013 .

[15]  H. Jaap van den Herik,et al.  Parallel Monte-Carlo Tree Search , 2008, Computers and Games.

[16]  Petr Baudis,et al.  PACHI: State of the Art Open Source Go Program , 2011, ACG.

[17]  Jonathan Schaeffer,et al.  Analysis of Transposition-Table-Driven Work Scheduling in Distributed Search , 1999, IEEE Trans. Parallel Distributed Syst..

[18]  Jonathan Schaeffer,et al.  SSS* = Alpha-Beta + TT , 2014, ArXiv.

[19]  Arch D. Robison,et al.  Composable Parallel Patterns with Intel Cilk Plus , 2013, Computing in Science & Engineering.

[20]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[21]  Tapio Elomaa,et al.  Machine Learning: ECML 2002 , 2002, Lecture Notes in Computer Science.

[22]  James Reinders,et al.  High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches , 2014 .

[23]  Bojun Huang,et al.  Pruning Game Tree by Rollouts , 2015, AAAI.

[24]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[25]  Ali Karami,et al.  A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor , 2014, ARCS.

[26]  Martin Müller,et al.  A Lock-Free Multithreaded Monte-Carlo Tree Search Algorithm , 2009, ACG.

[27]  Qing Zhang,et al.  High-Performance Computing on the Intel® Xeon Phi™ , 2014, Springer International Publishing.

[28]  Aske Plaat,et al.  Programming Parallel Applications In Cilk , 1997 .

[29]  H. Jaap van den Herik,et al.  Connecting Sciences , 2013, ICAART.

[30]  Rezaur Rahman Intel® Xeon Phi™ Coprocessor Architecture and Tools , 2013, Apress.

[31]  Wim Vanderbauwhede,et al.  Steal Locally, Share Globally , 2015, International Journal of Parallel Programming.

[32]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[33]  Barbara M. Chapman,et al.  OpenMP , 2005, Parallel Comput..