Maximum Entropy Monte-Carlo Planning

We develop a new algorithm for online planning in large scale sequential decision problems that improves upon the worst case efficiency of UCT. The idea is to augment Monte-Carlo Tree Search (MCTS) with maximum entropy policy optimization, evaluating each search node by softmax values back-propagated from simulation. To establish the effectiveness of this approach, we first investigate the single-step decision problem, stochastic softmax bandits, and show that softmax values can be estimated at an optimal convergence rate in terms of mean squared error. We then extend this approach to general sequential decision making by developing a general MCTS algorithm, Maximum Entropy for Tree Search (MENTS). We prove that the probability of MENTS failing to identify the best decision at the root decays exponentially, which fundamentally improves the polynomial convergence rate of UCT. Our experimental results also demonstrate that MENTS is more sample efficient than UCT in both synthetic problems and Atari 2600 games.

[1]  Tristan Cazenave,et al.  Ieee Transactions on Computational Intelligence and Ai in Games 1 Sequential Halving Applied to Trees , 2022 .

[2]  Carolyn Pillers Dobler,et al.  Mathematical Statistics , 2002 .

[3]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[4]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[5]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[6]  Martin Müller,et al.  Memory-Augmented Monte Carlo Tree Search , 2018, AAAI.

[7]  Dana S. Nau,et al.  An Analysis of Forward Pruning , 1994, AAAI.

[8]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[9]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[10]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[11]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[12]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[13]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[14]  Peter Stone,et al.  On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search , 2016, ICML.

[15]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[16]  David Tolpin,et al.  MCTS Based on Simple Regret , 2012, AAAI.

[17]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.