Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable general method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of $10^{50}$. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots.

[1]  Guy Lever,et al.  A Generalized Training Approach for Multiagent Learning , 2020, ICLR.

[2]  Tuomas Sandholm,et al.  Deep Counterfactual Regret Minimization , 2018, ICML.

[3]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[4]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[5]  Michael I. Jordan,et al.  RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[6]  Avrim Blum,et al.  Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[7]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[8]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[9]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[11]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[12]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[13]  Tim Roughgarden,et al.  Algorithmic Game Theory , 2007 .

[14]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[15]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[16]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[17]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[18]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[19]  Mark H. M. Winands,et al.  Quiescence Search for Stratego , 2009 .

[20]  Max Jaderberg,et al.  Open-ended Learning in Symmetric Zero-sum Games , 2019, ICML.

[21]  Petros Christodoulou,et al.  Soft Actor-Critic for Discrete Action Settings , 2019, ArXiv.

[22]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[23]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[24]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[25]  P. Taylor,et al.  Evolutionarily Stable Strategies and Game Dynamics , 1978 .

[26]  Michael H. Bowling,et al.  The Advantage Regret-Matching Actor-Critic , 2020, ArXiv.

[27]  Adam Lerer,et al.  DREAM: Deep Regret minimization with Advantage baselines and Model-free learning , 2020, ArXiv.

[28]  Maarten P. D. Schadd,et al.  The 3rd Stratego Computer World Championship , 2009 .