Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall

We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamics of the IIG is not known—we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order 1/ √ T where T is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.

[1]  Tuomas Sandholm,et al.  Bandit Linear Optimization for Sequential Decision Making and Extensive-Form Games , 2021, AAAI.

[2]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[3]  Yishay Mansour,et al.  Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.

[4]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[5]  H. W. Kuhn,et al.  11. Extensive Games and the Problem of Information , 1953 .

[6]  Tuomas Sandholm,et al.  Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games , 2021, AAAI.

[7]  Tuomas Sandholm,et al.  Faster Game Solving via Predictive Blackwell Approachability: Connecting Regret Matching and Mirror Descent , 2020, AAAI.

[8]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[9]  Tuomas Sandholm,et al.  Finding and Certifying (Near-)Optimal Strategies in Black-Box Extensive-Form Games , 2020, ArXiv.

[10]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[11]  Jun Zhu,et al.  Posterior sampling for multi-agent reinforcement learning: solving extensive games with imperfect information , 2020, ICLR.

[12]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[13]  B. Stengel,et al.  Efficient Computation of Behavior Strategies , 1996 .

[14]  Tuomas Sandholm,et al.  Solving Large Sequential Games with the Excessive Gap Technique , 2018, NeurIPS.

[15]  Daniel Hennes,et al.  Fast computation of Nash Equilibria in Imperfect Information Games , 2020, ICML.

[16]  Oskari Tammelin,et al.  Solving Large Imperfect Information Games Using CFR+ , 2014, ArXiv.

[17]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[18]  Marc Lanctot,et al.  Computing Approximate Nash Equilibria and Robust Best-Responses Using Sampling , 2011, J. Artif. Intell. Res..

[19]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[20]  Tuomas Sandholm,et al.  Optimistic Regret Minimization for Extensive-Form Games via Dilated Distance-Generating Functions , 2019, NeurIPS.

[21]  Kevin Waugh,et al.  Monte Carlo Sampling for Regret Minimization in Extensive Games , 2009, NIPS.

[22]  Javier Peña,et al.  First-Order Algorithm with O(ln(1/e)) Convergence for e-Equilibrium in Two-Person Zero-Sum Games , 2008, AAAI.

[23]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[24]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[25]  Tuomas Sandholm,et al.  Stochastic Regret Minimization in Extensive-Form Games , 2020, ICML.

[26]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[28]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[29]  Kevin Waugh,et al.  Faster First-Order Methods for Extensive-Form Game Solving , 2015, EC.

[30]  Rémi Munos,et al.  Efficient learning by implicit exploration in bandit problems with side observations , 2014, NIPS.

[31]  Geoffrey J. Gordon No-regret Algorithms for Online Convex Programs , 2006, NIPS.

[32]  Haipeng Luo,et al.  Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition , 2020, ICML.

[33]  Martin Schmid,et al.  Revisiting CFR+ and Alternating Updates , 2018, J. Artif. Intell. Res..

[34]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[35]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[36]  Javier Peña,et al.  Smoothing Techniques for Computing Nash Equilibria of Sequential Games , 2010, Math. Oper. Res..

[37]  Kevin Waugh,et al.  Faster algorithms for extensive-form game solving via improved smoothing functions , 2018, Mathematical Programming.

[38]  D. Koller,et al.  Efficient Computation of Equilibria for Extensive Two-Person Games , 1996 .