论文信息 - Anytime Optimal PSRO for Two-Player Zero-Sum Games

Anytime Optimal PSRO for Two-Player Zero-Sum Games

Policy Space Response Oracles (PSRO) is a multi-agent reinforcement learning algorithm for games that can handle continuous actions and has empirically found approximate Nash equilibria in large games. PSRO is based on the tabular Double Oracle (DO) method, an algorithm that is guaranteed to converge to a Nash equilibrium, but may increase exploitability from one iteration to the next. We propose Anytime Optimal Double Oracle (AODO), a tabular double oracle algorithm for 2-player zero-sum games that is guaranteed to converge to a Nash equilibrium while decreasing exploitability from iteration to iteration. Unlike DO, in which the meta-strategy is based on the restricted game formed by each player’s strategy sets, AODO finds the meta-strategy for each player that minimizes its exploitability against any policy in the full, unrestricted game. We also propose a method of finding this meta-strategy via a no-regret algorithm updated against a continually-trained best response, called RMBR DO. Finally, we propose Anytime Optimal PSRO, a version of AODO that calculates best responses via reinforcement learning. In experiments on Leduc poker and random normal form games, we show that our methods achieve far lower exploitability than DO and PSRO and never increase exploitability.

[1] Avrim Blum,et al. Planning in the Presence of Cost Functions Controlled by an Adversary , 2003, ICML.

[2] Pierre Baldi,et al. XDO: A Double Oracle Algorithm for Extensive-Form Games , 2021, ArXiv.

[3] David Silver,et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[4] Haitham Bou-Ammar,et al. Online Double Oracle , 2021, ArXiv.

[5] Michael P. Wellman,et al. Iterative Empirical Game Solving via Single Policy Best Response , 2021, ICLR.

[6] Wojciech M. Czarnecki,et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[7] Xidong Feng,et al. Discovering Multi-Agent Auto-Curricula in Two-Player Zero-Sum Games , 2021, ArXiv.

[8] Sriram Srinivasan,et al. OpenSpiel: A Framework for Reinforcement Learning in Games , 2019, ArXiv.

[9] Michael H. Bowling,et al. Finding Optimal Abstract Strategies in Extensive-Form Games , 2012, AAAI.

[10] Peter Bro Miltersen,et al. On Range of Skill , 2008, AAAI.

[11] Roy Fox,et al. Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games , 2020, NeurIPS.

[12] Michael H. Bowling,et al. A New Algorithm for Generating Equilibria in Massive Zero-Sum Games , 2007, AAAI.