Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games

We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is $uncoupled$, $convergent$, and $rational$, with non-asymptotic convergence rates. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an $\mathcal{O}(t^{-\frac{1}{8}})$ last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $\mathcal{O}(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a $path convergence$ rate, which is a new notion of convergence we defined, of $\mathcal{O}(t^{-\frac{1}{10}})$. Our algorithm removes the synchronization and prior knowledge requirement of [Wei et al., 2021], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [Chen et al., 2021, Cen et al., 2021] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.

[1]  Jason D. Lee,et al.  Can We Find Nash Equilibria at a Linear Rate in Markov Games? , 2023, ICLR.

[2]  Weiqiang Zheng,et al.  Doubly Optimal No-Regret Learning in Monotone Games , 2023, ICML.

[3]  Jianghai Hu,et al.  Zeroth-Order Learning in Continuous Games via Residual Pseudogradient Estimates , 2023, 2301.02279.

[4]  T. Zhang,et al.  A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Games , 2022, ICML.

[5]  S. Du,et al.  Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games , 2022, ICLR.

[6]  Cong Ma,et al.  $O(T^{-1})$ Convergence of Optimistic-Follow-the-Regularized-Leader in Two-Player Zero-Sum Markov Games , 2022, ArXiv.

[7]  Caiming Xiong,et al.  Policy Optimization for Markov Games: Unified Framework and Faster Convergence , 2022, NeurIPS.

[8]  Eduard A. Gorbunov,et al.  Last-Iterate Convergence of Optimistic Gradient Method for Monotone Variational Inequalities , 2022, NeurIPS.

[9]  M. Kamgarpour,et al.  On the Rate of Convergence of Payoff-based Algorithms to Nash Equilibrium in Strongly Monotone Games , 2022, ArXiv.

[10]  Tianyi Lin,et al.  Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback , 2021, 2112.02856.

[11]  Chi Jin,et al.  V-Learning - A Simple, Efficient, Decentralized Algorithm for Multiagent RL , 2021, ArXiv.

[12]  Ashutosh Nayyar,et al.  Learning Zero-sum Stochastic Games with Posterior Sampling , 2021, ArXiv.

[13]  Zhuoran Yang,et al.  Towards General Function Approximation in Zero-Sum Markov Games , 2021, ICLR.

[14]  Tiancheng Yu,et al.  The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces , 2021, ICML.

[15]  Tamer Basar,et al.  Decentralized Q-Learning in Zero-sum Markov Games , 2021, NeurIPS.

[16]  Yuejie Chi,et al.  Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization , 2021, NeurIPS.

[17]  Jason D. Lee,et al.  Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games , 2021, AISTATS.

[18]  Haipeng Luo,et al.  Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games , 2021, COLT.

[19]  Noah Golowich,et al.  Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[20]  Anant Sahai,et al.  On the Impossibility of Convergence of Mixed Strategies with No Regret Learning , 2020, ArXiv.

[21]  Noah Golowich,et al.  Tight last-iterate convergence rates for no-regret learning in multi-player games , 2020, NeurIPS.

[22]  A. Ozdaglar,et al.  Fictitious play in zero-sum stochastic games , 2020, SIAM J. Control. Optim..

[23]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[24]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[25]  Haipeng Luo,et al.  Linear Last-iterate Convergence in Constrained Saddle-point Optimization , 2020, ICLR.

[26]  Zhuoran Yang,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT.

[27]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[28]  J. Malick,et al.  On the convergence of single-call stochastic extra-gradient methods , 2019, NeurIPS.

[29]  Xiaoyu Chen,et al.  Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[30]  David S. Leslie,et al.  Bandit learning in concave $N$-person games , 2018, 1810.01925.

[31]  Georgios Piliouras,et al.  Multiplicative Weights Update in Zero-Sum Games , 2018, EC.

[32]  Tengyuan Liang,et al.  Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks , 2018, AISTATS.

[33]  Chi-Jen Lu,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[34]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[35]  Christos H. Papadimitriou,et al.  Cycles in adversarial regularized learning , 2017, SODA.

[36]  Serdar Yüksel,et al.  Decentralized Q-Learning for Stochastic Teams and Games , 2015, IEEE Transactions on Automatic Control.

[37]  Gergely Neu,et al.  Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[38]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[39]  Constantinos Daskalakis,et al.  Near-optimal no-regret algorithms for zero-sum games , 2011, SODA '11.

[40]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[41]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[42]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[43]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[44]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[45]  P. Tseng On linear convergence of iterative methods for the variational inequality problem , 1995 .

[46]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[47]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[48]  J. Wal Discounted Markov games: Generalized policy iteration method , 1978 .

[49]  M. Pollatschek,et al.  Algorithms for Stochastic Games with Geometrical Interpretation , 1969 .

[50]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[51]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[52]  Shaocong Ma,et al.  Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game , 2022, ICLR.

[53]  Yang Cai,et al.  Finite-Time Last-Iterate Convergence for Learning in Multi-Player Games , 2022, NeurIPS.

[54]  V. Cevher,et al.  A Natural Actor-Critic Framework for Zero-Sum Markov Games , 2022, ICML.

[55]  P. Mertikopoulos,et al.  On the Rate of Convergence of Regularized Learning in Games: From Bandits and Uncertainty to Optimism and Beyond , 2021, NeurIPS.

[56]  Quanquan Gu,et al.  Almost Optimal Algorithms for Two-player Zero-Sum Markov Games with Linear Function Approximation , 2021 .

[57]  Alon Gonen Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual , 2015 .

[58]  J. Filar,et al.  On the Algorithm of Pollatschek and Avi-ltzhak , 1991 .

[59]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .