Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

Search is an important tool for computing effective policies in singleand multi-agent environments, and has been crucial for achieving superhuman performance in several benchmark fully and partially observable games. However, one major limitation of prior search approaches for partially observable environments is that the computational cost scales poorly with the amount of hidden information. In this paper we present Learned Belief Search (LBS), a computationally efficient search procedure for partially observable environments. Rather than maintaining an exact belief distribution, LBS uses an approximate auto-regressive counterfactual belief that is learned as a supervised task. In multiagent settings, LBS uses a novel public-private model architecture for underlying policies in order to efficiently evaluate these policies during rollouts. In the benchmark domain of Hanabi, LBS can obtain 55% ∼ 91% of the benefit of exact search while reducing compute requirements by 35.8× ∼ 4.6×, allowing it to scale to larger settings that were inaccessible to previous search methods.

[1]  Joel Veness,et al.  Monte-Carlo Planning in Large POMDPs , 2010, NIPS.

[2]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[3]  H. Francis Song,et al.  Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[4]  Jakob N. Foerster,et al.  "Other-Play" for Zero-Shot Coordination , 2020, ICML.

[5]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[6]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[7]  Jakob N. Foerster,et al.  Improving Policies via Search in Cooperative Partially Observable Games , 2019, AAAI.

[8]  Julian Togelius,et al.  Diverse Agents for Ad-Hoc Cooperation in Hanabi , 2019, 2019 IEEE Conference on Games (CoG).

[9]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[10]  Brandon Cui,et al.  Off-Belief Learning , 2021, ICML.

[11]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[12]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[13]  Michael H. Bowling,et al.  Rethinking Formal Models of Partially Observable Multiagent Decision Making , 2019, Artif. Intell..

[14]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[15]  Dimitri P. Bertsekas,et al.  Rollout Algorithms for Stochastic Scheduling Problems , 1999, J. Heuristics.

[16]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[17]  Branislav Bosanský,et al.  Solving Partially Observable Stochastic Games with Public Observations , 2019, AAAI.

[18]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[19]  H. Francis Song,et al.  The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..

[20]  Joelle Pineau,et al.  Online Planning Algorithms for POMDPs , 2008, J. Artif. Intell. Res..

[21]  Lasse Becker-Czarnetzki Report on DeepStack Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker , 2019 .

[22]  Hengyuan Hu,et al.  Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning , 2020, ICLR.

[23]  Geoffrey J. Gordon,et al.  Finding Approximate POMDP solutions Through Belief Compression , 2011, J. Artif. Intell. Res..

[25]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[26]  Simon M. Lucas,et al.  Evaluating and modelling Hanabi-playing agents , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.