Policy Gradient Search: Online Planning and Expert Iteration without Search Trees

Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not scale to games with very high branching factors. We propose an alternative simulation-based search method, Policy Gradient Search (PGS), which adapts a neural network simulation policy online via policy gradient updates, avoiding the need for a search tree. In Hex, PGS achieves comparable performance to MCTS, and an agent trained using Expert Iteration with PGS was able defeat MoHex 2.0, the strongest open-source Hex agent, in 9x9 Hex.

[1]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[2]  Gerald Tesauro,et al.  On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[3]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[4]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[5]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[7]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[8]  David Silver,et al.  Reinforcement learning and simulation-based search in computer go , 2009 .

[9]  Ryan B. Hayward,et al.  Monte Carlo Tree Search in Hex , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[10]  Richard B. Segal,et al.  On the Scalability of Parallel UCT , 2010, Computers and Games.

[11]  Nataliya Sokolovska,et al.  Continuous Upper Confidence Trees , 2011, LION.

[12]  Christopher D. Rosin,et al.  Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[13]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[14]  Jakub Pawlewicz,et al.  Scalable Parallel DFPN Search , 2013, Computers and Games.

[15]  Shih-Chieh Huang,et al.  MoHex 2.0: A Pattern-Based MCTS Hex Player , 2013, Computers and Games.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Marco Platzner,et al.  Adaptive Playouts in Monte-Carlo Tree Search with Policy-Gradient Reinforcement Learning , 2015, ACG.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[20]  Kenny Young,et al.  Neurohex: A Deep Q-learning Hex Agent , 2016, CGW@IJCAI.

[21]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[22]  Masahito Yamamoto,et al.  Reinforcement Learning for Creating Evaluation Function Using Convolutional Neural Network in Hex , 2017, 2017 Conference on Technologies and Applications of Artificial Intelligence (TAAI).

[23]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[24]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[25]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[26]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[27]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28]  Chao Gao,et al.  Adversarial Policy Gradient for Alternating Markov Games , 2018, ICLR.

[29]  Martin Müller,et al.  Move Prediction Using Deep Convolutional Neural Networks in Hex , 2018, IEEE Transactions on Games.

[30]  Sergey Levine,et al.  Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[31]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[32]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.