Learning Probably Approximately Correct Maximin Strategies in Simulation-Based Games with Infinite Strategy Spaces

We tackle the problem of learning equilibria in simulation-based games. In such games, the players' utility functions cannot be described analytically, as they are given through a black-box simulator that can be queried to obtain noisy estimates of the utilities. This is the case in many real-world games in which a complete description of the elements involved is not available upfront, such as complex military settings and online auctions. In these situations, one usually needs to run costly simulation processes to get an accurate estimate of the game outcome. As a result, solving these games begets the challenge of designing learning algorithms that can find (approximate) equilibria with high confidence, using as few simulator queries as possible. Moreover, since running the simulator during the game is unfeasible, the algorithms must first perform a pure exploration learning phase and, then, use the (approximate) equilibrium learned this way to play the game. In this work, we focus on two-player zero-sum games with infinite strategy spaces. Drawing from the best arm identification literature, we design two algorithms with theoretical guarantees to learn maximin strategies in these games. The first one works in the fixed-confidence setting, guaranteeing the desired confidence level while minimizing the number of queries. Instead, the second algorithm fits the fixed-budget setting, maximizing the confidence without exceeding the given maximum number of queries. First, we formally prove {\delta}-PAC theoretical guarantees for our algorithms under some regularity assumptions, which are encoded by letting the utility functions be drawn from a Gaussian process. Then, we experimentally evaluate our techniques on a testbed made of randomly generated games and instances representing simple real-world security settings.

[1]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[2]  Amy Greenwald,et al.  Empirical Mechanism Design: Designing Mechanisms from Data , 2019, UAI.

[3]  Tao Qin,et al.  Competitive Bridge Bidding with Deep Neural Networks , 2019, AAMAS.

[4]  David S. Leslie,et al.  Bandit learning in concave $N$-person games , 2018, 1810.01925.

[5]  Michael P. Wellman,et al.  Strategic analysis with simulation-based games , 2009, Proceedings of the 2009 Winter Simulation Conference (WSC).

[6]  H. Stackelberg,et al.  Marktform und Gleichgewicht , 1935 .

[7]  Marcello Restelli,et al.  Equilibrium approximation in simulation-based extensive-form games , 2011, AAMAS.

[8]  J. Zico Kolter,et al.  What game are we playing? End-to-end learning in normal and extensive form games , 2018, IJCAI.

[9]  J. Zico Kolter,et al.  Large Scale Learning of Agent Rationality in Two-Player Zero-Sum Games , 2019, AAAI.

[10]  M. Sion On general minimax theorems , 1958 .

[11]  Eli Upfal,et al.  Learning Simulation-Based Games from Data , 2019, AAMAS.

[12]  Wouter M. Koolen,et al.  Maximin Action Identification: A New Bandit Framework for Games , 2016, COLT.

[13]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[14]  R. Munos,et al.  Best Arm Identification in Multi-Armed Bandits , 2010, COLT.

[15]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[16]  Noam Brown,et al.  Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , 2018, Science.

[17]  Tuomas Sandholm,et al.  Safe and Nested Subgame Solving for Imperfect-Information Games , 2017, NIPS.

[18]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[19]  Nicola Gatti,et al.  Truthful learning mechanisms for multi-slot sponsored search auctions with externalities , 2012, Artif. Intell..

[20]  Michal Valko,et al.  Multiagent Evaluation under Incomplete Information , 2019, NeurIPS.

[21]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[22]  Michael P. Wellman,et al.  A Regression Approach for Modeling Games With Many Symmetric Players , 2018, AAAI.

[23]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[24]  Samuel Sokota,et al.  Learning Deviation Payoffs in Simulation-Based Games , 2019, AAAI.

[25]  Milind Tambe,et al.  Melding the Data-Decisions Pipeline: Decision-Focused Learning for Combinatorial Optimization , 2018, AAAI.

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Michael P. Wellman,et al.  Learning payoff functions in infinite games , 2005, Machine Learning.

[28]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[29]  Joel Z. Leibo,et al.  A Generalised Method for Empirical Game Theoretic Analysis , 2018, AAMAS.

[30]  Milind Tambe,et al.  Security and Game Theory - Algorithms, Deployed Systems, Lessons Learned , 2011 .

[31]  Michael P. Wellman,et al.  Probably Almost Stable Strategy Profiles in Simulation-Based Games , 2019 .

[32]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[33]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[34]  Marcello Restelli,et al.  Regret Minimization Algorithms for the Followers Behaviour Identification in Leadership Games , 2017, UAI.