Monte-Carlo Tree Search as Regularized Policy Optimization

The combination of Monte-Carlo tree search (MCTS) with deep reinforcement learning has led to significant advances in artificial intelligence. However, AlphaZero, the current state-of-the-art MCTS algorithm, still relies on handcrafted heuristics that are only partially understood. In this paper, we show that AlphaZero's search heuristics, along with other common ones such as UCT, are an approximation to the solution of a specific regularized policy optimization problem. With this insight, we propose a variant of AlphaZero which uses the exact solution to this policy optimization problem, and show experimentally that it reliably outperforms the original algorithm in multiple domains.

[1]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[2]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[3]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[6]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[7]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[8]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[9]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[10]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[11]  Rémi Munos,et al.  Learning to Search with MCTSnets , 2018, ICML.

[12]  Shimon Whiteson,et al.  TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.

[13]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14]  Yunhao Tang,et al.  Discretizing Continuous Action Space for On-Policy Optimization , 2019, AAAI.

[15]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[17]  Koray Kavukcuoglu,et al.  Combining policy gradient and Q-learning , 2016, ICLR.

[18]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[19]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[20]  Navdeep Jaitly,et al.  Discrete Sequential Prediction of Continuous Actions for Deep RL , 2017, ArXiv.

[21]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[22]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[23]  Richard Evans,et al.  Deep Reinforcement Learning in Large Discrete Action Spaces , 2015, 1512.07679.

[24]  Tim Salimans,et al.  Policy Gradient Search: Online Planning and Expert Iteration without Search Trees , 2019, ArXiv.

[25]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[26]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[27]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[28]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  Jessica B. Hamrick,et al.  Combining Q-Learning and Search with Amortized Value Estimates , 2020, ICLR.

[31]  Michal Valko,et al.  Planning in entropy-regularized Markov decision processes and games , 2019, NeurIPS.

[32]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[33]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[34]  Christopher D. Rosin,et al.  Multi-armed bandits with episode context , 2011, Annals of Mathematics and Artificial Intelligence.

[35]  Andriy Mnih,et al.  Q-Learning in enormous action spaces via amortized approximate maximization , 2020, ArXiv.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  H. Francis Song,et al.  V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control , 2019, ICLR.

[38]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[39]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[40]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[41]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.