Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning

When observing the actions of others, humans make inferences about why they acted as they did, and what this implies about the world; humans also use the fact that their actions will be interpreted in this manner, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of two-player, zero-sum games, scalable multi-agent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. We present the Bayesian action decoder (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. BAD introduces a new Markov decision process, the public belief MDP, in which the action space consists of all deterministic partial policies, and exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over all partial policies mapping private information into environment actions. The Bayesian update is closely related to the theory of mind reasoning that humans carry out when observing others' actions. We first validate BAD on a proof-of-principle two-step matrix game, where it outperforms policy gradient methods; we then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where, in the two-player setting, it surpasses all previously published learning and hand-coded approaches, establishing a new state of the art.

[1]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[2]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[3]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[4]  P. J. Gmytrasiewicz,et al.  A Framework for Sequential Planning in Multi-Agent Settings , 2005, AI&M.

[5]  Siobhan Chapman Logic and Conversation , 2005 .

[6]  Jürgen Schmidhuber,et al.  Recurrent policy gradients , 2010, Log. J. IGPL.

[7]  Michael H. Bowling,et al.  The lemonade stand game competition: solving unsolvable games , 2011, SECO.

[8]  Michael C. Frank,et al.  Predicting Pragmatic Reasoning in Language Games , 2012, Science.

[9]  Frans A. Oliehoek,et al.  Decentralized POMDPs , 2012, Reinforcement Learning.

[10]  Ashutosh Nayyar,et al.  Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach , 2012, IEEE Transactions on Automatic Control.

[11]  C. Cox,et al.  How to Make the Perfect Fireworks Display: Two Strategies for Hanabi , 2015 .

[12]  Hirotaka Osawa,et al.  Solving Hanabi: Estimating Hands by Opponent's Actions in Cooperative Game with Incomplete Information , 2015, AAAI Workshop: Computer Poker and Imperfect Information.

[13]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[14]  Shimon Whiteson,et al.  Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks , 2016, ArXiv.

[15]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[16]  Marcel Roeloffzen,et al.  Hanabi is NP-complete, Even for Cheaters who Look at Their Cards , 2016, FUN.

[17]  Bruno Bouzy,et al.  Playing Hanabi Near-Optimally , 2017, ACG.

[18]  Simon M. Lucas,et al.  Evaluating and modelling Hanabi-playing agents , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[19]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[20]  Marcel Roeloffzen,et al.  Hanabi is NP-hard, even for cheaters who look at their cards , 2016, Theor. Comput. Sci..

[21]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[22]  Chris Martens,et al.  An intentional AI for hanabi , 2017, 2017 IEEE Conference on Computational Intelligence and Games (CIG).

[23]  Chris L. Baker,et al.  Rational quantitative attribution of beliefs, desires and percepts in human mentalizing , 2017, Nature Human Behaviour.

[24]  Joel Z. Leibo,et al.  A multi-agent reinforcement learning model of common-pool resource appropriation , 2017, NIPS.

[25]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[26]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[27]  Joel Z. Leibo,et al.  Human-level performance in first-person multiplayer games with population-based deep reinforcement learning , 2018, ArXiv.

[28]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[29]  H. Francis Song,et al.  The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..