Scalable Online Planning via Reinforcement Learning Fine-Tuning

Lookahead search has been a critical component of recent AI successes, such as in the games of chess, go, and poker. However, the search methods used in these games, and in many other settings, are tabular. Tabular search methods do not scale well with the size of the search space, and this problem is exacerbated by stochasticity and partial observability. In this work we replace tabular search with online model-based fine-tuning of a policy neural network via reinforcement learning, and show that this approach outperforms state-of-the-art search algorithms in benchmark settings. In particular, we use our search algorithm to achieve a new state-of-the-art result in self-play Hanabi, and show the generality of our algorithm by also showing that it outperforms tabular search in the Atari game Ms. Pacman.

[1]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[2]  Rémi Munos,et al.  From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning , 2014, Found. Trends Mach. Learn..

[3]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[4]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[5]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[6]  Adam Lerer,et al.  Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings , 2021, ArXiv.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  E. J. Sondik,et al.  The Optimal Control of Partially Observable Markov Decision Processes. , 1971 .

[9]  Michael H. Bowling,et al.  Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty , 2016, IJCAI.

[10]  W. Marsden I and J , 2012 .

[11]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[12]  Michal Valko,et al.  Monte-Carlo Tree Search as Regularized Policy Optimization , 2020, ICML.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  David Barber,et al.  Thinking Fast and Slow with Deep Learning and Tree Search , 2017, NIPS.

[15]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[16]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[17]  N. Whitman A bitter lesson. , 1999, Academic medicine : journal of the Association of American Medical Colleges.

[18]  Jakob N. Foerster,et al.  "Other-Play" for Zero-Shot Coordination , 2020, ICML.

[19]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[20]  Denis Yarats,et al.  The Differentiable Cross-Entropy Method , 2020, ICML.

[21]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[23]  David Silver,et al.  Online and Offline Reinforcement Learning by Planning with a Learned Model , 2021, NeurIPS.

[24]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[25]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[26]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[27]  Jakob N. Foerster,et al.  Improving Policies via Search in Cooperative Partially Observable Games , 2019, AAAI.

[28]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[29]  Jessica B. Hamrick,et al.  Combining Q-Learning and Search with Amortized Value Estimates , 2020, ICLR.

[30]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[31]  Jackie Kay,et al.  Local Search for Policy Iteration in Continuous Control , 2020, ArXiv.

[32]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[33]  P. Alam ‘G’ , 2021, Composites Engineering: An A–Z Guide.

[34]  Tim Salimans,et al.  Policy Gradient Search: Online Planning and Expert Iteration without Search Trees , 2019, ArXiv.

[35]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[36]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[37]  Alessandro Davide Ialongo,et al.  Iterative Amortized Policy Optimization , 2020, NeurIPS.

[38]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[39]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[40]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[41]  H. Francis Song,et al.  Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[42]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[43]  David Silver,et al.  Learning and Planning in Complex Action Spaces , 2021, ICML.

[44]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[45]  Adam Lerer,et al.  Combining Deep Reinforcement Learning and Search for Imperfect-Information Games , 2020, NeurIPS.

[46]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[47]  Shie Mannor,et al.  Multiple-Step Greedy Policies in Approximate and Online Reinforcement Learning , 2018, NeurIPS.

[48]  Hengyuan Hu,et al.  Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning , 2020, ICLR.

[49]  Peter I. Cowling,et al.  Information Set Monte Carlo Tree Search , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[50]  H. Francis Song,et al.  The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..

[51]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.