论文信息 - Policy Gradient Search: Online Planning and Expert Iteration without Search Trees

Policy Gradient Search: Online Planning and Expert Iteration without Search Trees

Monte Carlo Tree Search (MCTS) algorithms perform simulation-based search to improve policies online. During search, the simulation policy is adapted to explore the most promising lines of play. MCTS has been used by state-of-the-art programs for many problems, however a disadvantage to MCTS is that it estimates the values of states with Monte Carlo averages, stored in a search tree; this does not scale to games with very high branching factors. We propose an alternative simulation-based search method, Policy Gradient Search (PGS), which adapts a neural network simulation policy online via policy gradient updates, avoiding the need for a search tree. In Hex, PGS achieves comparable performance to MCTS, and an agent trained using Expert Iteration with PGS was able defeat MoHex 2.0, the strongest open-source Hex agent, in 9x9 Hex.

[1] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[2] Gerald Tesauro,et al. On-line Policy Improvement using Monte-Carlo Search , 1996, NIPS.

[3] Peter Auer,et al. Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[4] Michail G. Lagoudakis,et al. Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[5] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[7] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[8] David Silver,et al. Reinforcement learning and simulation-based search in computer go , 2009 .

[9] Ryan B. Hayward,et al. Monte Carlo Tree Search in Hex , 2010, IEEE Transactions on Computational Intelligence and AI in Games.

[10] Richard B. Segal,et al. On the Scalability of Parallel UCT , 2010, Computers and Games.

[11] Nataliya Sokolovska,et al. Continuous Upper Confidence Trees , 2011, LION.

[12] Christopher D. Rosin,et al. Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[13] Simon M. Lucas,et al. A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[14] Jakub Pawlewicz,et al. Scalable Parallel DFPN Search , 2013, Computers and Games.

[15] Shih-Chieh Huang,et al. MoHex 2.0: A Pattern-Based MCTS Hex Player , 2013, Computers and Games.

[16] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17] Marco Platzner,et al. Adaptive Playouts in Monte-Carlo Tree Search with Policy-Gradient Reinforcement Learning , 2015, ACG.

[18] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19] Matthieu Geist,et al. Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[20] Kenny Young,et al. Neurohex: A Deep Q-learning Hex Agent , 2016, CGW@IJCAI.

[21] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[22] Masahito Yamamoto,et al. Reinforcement Learning for Creating Evaluation Function Using Convolutional Neural Network in Hex , 2017, 2017 Conference on Technologies and Applications of Artificial Intelligence (TAAI).

[23] Demis Hassabis,et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[24] David Barber,et al. Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[25] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[26] Yee Whye Teh,et al. Distral: Robust multitask reinforcement learning , 2017, NIPS.

[27] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[28] Chao Gao,et al. Adversarial Policy Gradient for Alternating Markov Games , 2018, ICLR.

[29] Martin Müller,et al. Move Prediction Using Deep Convolutional Neural Networks in Hex , 2018, IEEE Transactions on Games.

[30] Sergey Levine,et al. Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[31] Michael I. Jordan,et al. Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[32] Demis Hassabis,et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.