Finding and Certifying (Near-)Optimal Strategies in Black-Box Extensive-Form Games

Often---for example in war games, strategy video games, and financial simulations---the game is given to us only as a black-box simulator in which we can play it. In these settings, since the game may have unknown nature action distributions (from which we can only obtain samples) and/or be too large to expand fully, it can be difficult to compute strategies with guarantees on exploitability. Recent work \cite{Zhang20:Small} resulted in a notion of certificate for extensive-form games that allows exploitability guarantees while not expanding the full game tree. However, that work assumed that the black box could sample or expand arbitrary nodes of the game tree at any time, and that a series of exact game solves (via, for example, linear programming) can be conducted to compute the certificate. Each of those two assumptions severely restricts the practical applicability of that method. In this work, we relax both of the assumptions. We show that high-probability certificates can be obtained with a black box that can do nothing more than play through games, using only a regret minimizer as a subroutine. As a bonus, we obtain an equilibrium-finding algorithm with $\tilde O(\sqrt{T})$ regret bound in the extensive-form game setting that does not rely on a sampling strategy with lower-bounded reach probabilities (which MCCFR assumes). We demonstrate experimentally that, in the black-box setting, our methods are able to provide nontrivial exploitability guarantees while expanding only a small fraction of the game tree.

[1]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[2]  Tuomas Sandholm,et al.  A Competitive Texas Hold'em Poker Player via Automated Abstraction and Real-Time Equilibrium Computation , 2006, AAAI.

[3]  Tuomas Sandholm,et al.  A Unified Framework for Extensive-Form Game Abstraction with Bounds , 2018, NeurIPS.

[4]  Michael H. Bowling,et al.  No-Regret Learning in Extensive-Form Games with Imperfect Recall , 2012, ICML.

[5]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[6]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[7]  Tuomas Sandholm,et al.  Lossless abstraction of imperfect information games , 2007, JACM.

[8]  Tuomas Sandholm,et al.  Discretization of Continuous Action Spaces in Extensive-Form Games , 2015, AAMAS.

[9]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[10]  Michael P. Wellman,et al.  Methods for empirical game-theoretic analysis (extended abstract) , 2006 .

[11]  Kevin Waugh,et al.  DeepStack: Expert-level artificial intelligence in heads-up no-limit poker , 2017, Science.

[12]  Tuomas Sandholm,et al.  Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games , 2021, AAAI.

[13]  Javier Peña,et al.  Smoothing Techniques for Computing Nash Equilibria of Sequential Games , 2010, Math. Oper. Res..

[14]  Jonathan Schaeffer,et al.  Approximating Game-Theoretic Optimal Strategies for Full-scale Poker , 2003, IJCAI.

[15]  David Silver,et al.  A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning , 2017, NIPS.

[16]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[17]  Tuomas Sandholm,et al.  Small Nash Equilibrium Certificates in Very Large Games , 2020, NeurIPS.

[18]  Nicola Basilico,et al.  Automated Abstractions for Patrolling Security Games , 2011, AAAI.

[19]  Tuomas Sandholm,et al.  Solving Large Sequential Games with the Excessive Gap Technique , 2018, NeurIPS.

[20]  Marcello Restelli,et al.  Equilibrium approximation in simulation-based extensive-form games , 2011, AAMAS.

[21]  Neil Burch,et al.  Heads-up limit hold’em poker is solved , 2015, Science.

[22]  Tuomas Sandholm,et al.  Imperfect-Recall Abstractions with Bounds in Games , 2014, EC.

[23]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[24]  Tuomas Sandholm,et al.  Stochastic Regret Minimization in Extensive-Form Games , 2020, ICML.

[25]  Tuomas Sandholm,et al.  Extensive-form game abstraction with bounds , 2014, EC.

[26]  Tuomas Sandholm,et al.  Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games , 2021, AAAI.

[27]  Bernhard von Stengel,et al.  Fast algorithms for finding randomized strategies in game trees , 1994, STOC '94.

[28]  Tuomas Sandholm,et al.  Simultaneous Abstraction and Equilibrium Finding in Games , 2015, IJCAI.

[29]  Michael H. Bowling,et al.  Bayes' Bluff: Opponent Modelling in Poker , 2005, UAI 2005.

[30]  Branislav Bosanský,et al.  An Exact Double-Oracle Algorithm for Zero-Sum Extensive-Form Games with Imperfect Information , 2014, J. Artif. Intell. Res..

[31]  Branislav Bosanský,et al.  An Algorithm for Constructing and Solving Imperfect Recall Abstractions of Large Extensive-Form Games , 2017, IJCAI.

[32]  Tuomas Sandholm,et al.  Lossy stochastic game abstraction with bounds , 2012, EC '12.

[33]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[34]  Tuomas Sandholm,et al.  Solving Imperfect-Information Games via Discounted Regret Minimization , 2018, AAAI.

[35]  Amy Greenwald,et al.  Improved Algorithms for Learning Equilibria in Simulation-Based Games , 2020, AAMAS.

[36]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[37]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[38]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[39]  Joel Z. Leibo,et al.  A Generalised Method for Empirical Game Theoretic Analysis , 2018, AAMAS.