XDO: A Double Oracle Algorithm for Extensive-Form Games

Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm for two-player zero-sum games that has empirically found approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to a Nash equilibrium, it may take an exponential number of iterations as the number of infostates grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations 1-2 orders of magnitude smaller than PSRO. In experiments on a modified Leduc poker game, we show that tabular XDO achieves over 11x lower exploitability than CFR and over 82x lower exploitability than PSRO and XFP in the same amount of time. We also show that NXDO beats PSRO and is competitive with NFSP on a large no-limit poker game.

[1]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[2]  Adam Lerer,et al.  DREAM: Deep Regret minimization with Advantage baselines and Model-free learning , 2020, ArXiv.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[5]  Michael H. Bowling,et al.  The Advantage Regret-Matching Actor-Critic , 2020, ArXiv.

[6]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[7]  Shimon Whiteson,et al.  Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2020, J. Mach. Learn. Res..

[8]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[9]  Roy Fox,et al.  Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games , 2020, NeurIPS.

[10]  Shauharda Khadka,et al.  Evolutionary Reinforcement Learning for Sample-Efficient Multiagent Coordination , 2019, ICML.

[11]  Tuomas Sandholm,et al.  Dynamic Thresholding and Pruning for Regret Minimization , 2017, AAAI.

[12]  Tuomas Sandholm,et al.  Deep Counterfactual Regret Minimization , 2018, ICML.

[13]  David Silver,et al.  Deep Reinforcement Learning from Self-Play in Imperfect-Information Games , 2016, ArXiv.

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[16]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[17]  Tuomas Sandholm,et al.  Simultaneous Abstraction and Equilibrium Finding in Games , 2015, IJCAI.

[18]  Michael I. Jordan,et al.  RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[19]  Branislav Bosanský,et al.  An Exact Double-Oracle Algorithm for Zero-Sum Extensive-Form Games with Imperfect Information , 2014, J. Artif. Intell. Res..

[20]  Branislav Bosanský,et al.  An Algorithm for Constructing and Solving Imperfect Recall Abstractions of Large Extensive-Form Games , 2017, IJCAI.

[21]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[22]  Eric Steinberger,et al.  Single Deep Counterfactual Regret Minimization , 2019, ArXiv.

[23]  H. W. Kuhn,et al.  Contributions to the Theory of Games. Volume II , 1954 .

[24]  David Silver,et al.  Fictitious Self-Play in Extensive-Form Games , 2015, ICML.

[25]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[26]  Tuomas Sandholm,et al.  Regret-Based Pruning in Extensive-Form Games , 2015, NIPS.

[27]  Sriram Srinivasan,et al.  OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[28]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[29]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[30]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[31]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[32]  Yuan Qi,et al.  Double Neural Counterfactual Regret Minimization , 2018, ICLR.

[33]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[34]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[35]  Oskari Tammelin,et al.  Solving Large Imperfect Information Games Using CFR+ , 2014, ArXiv.

[36]  Guy Lever,et al.  Human-level performance in 3D multiplayer games with population-based reinforcement learning , 2018, Science.

[37]  Avrim Blum,et al.  Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[38]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.

[39]  Milan Hladík,et al.  Bounding the Support Size in Extensive Form Games with Imperfect Information , 2014, AAAI.