论文信息 - Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games - 字舞流文

Uncoupled and Convergent Learning in Two-Player Zero-Sum Markov Games

We revisit the problem of learning in two-player zero-sum Markov games, focusing on developing an algorithm that is $uncoupled$, $convergent$, and $rational$, with non-asymptotic convergence rates. We start from the case of stateless matrix game with bandit feedback as a warm-up, showing an $\mathcal{O}(t^{-\frac{1}{8}})$ last-iterate convergence rate. To the best of our knowledge, this is the first result that obtains finite last-iterate convergence rate given access to only bandit feedback. We extend our result to the case of irreducible Markov games, providing a last-iterate convergence rate of $\mathcal{O}(t^{-\frac{1}{9+\varepsilon}})$ for any $\varepsilon>0$. Finally, we study Markov games without any assumptions on the dynamics, and show a $path convergence$ rate, which is a new notion of convergence we defined, of $\mathcal{O}(t^{-\frac{1}{10}})$. Our algorithm removes the synchronization and prior knowledge requirement of [Wei et al., 2021], which pursued the same goals as us for irreducible Markov games. Our algorithm is related to [Chen et al., 2021, Cen et al., 2021] and also builds on the entropy regularization technique. However, we remove their requirement of communications on the entropy values, making our algorithm entirely uncoupled.

Haipeng Luo | Yang Cai | Chen-Yu Wei | Weiqiang Zheng

[1] Jason D. Lee,et al. Can We Find Nash Equilibria at a Linear Rate in Markov Games? , 2023, ICLR.

[2] Weiqiang Zheng,et al. Doubly Optimal No-Regret Learning in Monotone Games , 2023, ICML.

[3] Jianghai Hu,et al. Zeroth-Order Learning in Continuous Games via Residual Pseudogradient Estimates , 2023, 2301.02279.

[4] T. Zhang,et al. A Self-Play Posterior Sampling Algorithm for Zero-Sum Markov Games , 2022, ICML.

[5] S. Du,et al. Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games , 2022, ICLR.

[6] Cong Ma,et al. $O(T^{-1})$ Convergence of Optimistic-Follow-the-Regularized-Leader in Two-Player Zero-Sum Markov Games , 2022, ArXiv.

[7] Caiming Xiong,et al. Policy Optimization for Markov Games: Unified Framework and Faster Convergence , 2022, NeurIPS.

[8] Eduard A. Gorbunov,et al. Last-Iterate Convergence of Optimistic Gradient Method for Monotone Variational Inequalities , 2022, NeurIPS.

[9] M. Kamgarpour,et al. On the Rate of Convergence of Payoff-based Algorithms to Nash Equilibrium in Strongly Monotone Games , 2022, ArXiv.

[10] Tianyi Lin,et al. Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback , 2021, 2112.02856.

[11] Chi Jin,et al. V-Learning - A Simple, Efficient, Decentralized Algorithm for Multiagent RL , 2021, ArXiv.

[12] Ashutosh Nayyar,et al. Learning Zero-sum Stochastic Games with Posterior Sampling , 2021, ArXiv.

[13] Zhuoran Yang,et al. Towards General Function Approximation in Zero-Sum Markov Games , 2021, ICLR.

[14] Tiancheng Yu,et al. The Power of Exploiter: Provable Multi-Agent RL in Large State Spaces , 2021, ICML.

[15] Tamer Basar,et al. Decentralized Q-Learning in Zero-sum Markov Games , 2021, NeurIPS.

[16] Yuejie Chi,et al. Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization , 2021, NeurIPS.

[17] Jason D. Lee,et al. Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games , 2021, AISTATS.

[18] Haipeng Luo,et al. Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games , 2021, COLT.

[19] Noah Golowich,et al. Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[20] Anant Sahai,et al. On the Impossibility of Convergence of Mixed Strategies with No Regret Learning , 2020, ArXiv.

[21] Noah Golowich,et al. Tight last-iterate convergence rates for no-regret learning in multi-player games , 2020, NeurIPS.

[22] A. Ozdaglar,et al. Fictitious play in zero-sum stochastic games , 2020, SIAM J. Control. Optim..

[23] Qinghua Liu,et al. A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[24] Chi Jin,et al. Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[25] Haipeng Luo,et al. Linear Last-iterate Convergence in Constrained Saddle-point Optimization , 2020, ICLR.

[26] Zhuoran Yang,et al. Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT.

[27] Chi Jin,et al. Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[28] J. Malick,et al. On the convergence of single-call stochastic extra-gradient methods , 2019, NeurIPS.

[29] Xiaoyu Chen,et al. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.

[30] David S. Leslie,et al. Bandit learning in concave $N$-person games , 2018, 1810.01925.

[31] Georgios Piliouras,et al. Multiplicative Weights Update in Zero-Sum Games , 2018, EC.

[32] Tengyuan Liang,et al. Interaction Matters: A Note on Non-asymptotic Local Convergence of Generative Adversarial Networks , 2018, AISTATS.

[33] Chi-Jen Lu,et al. Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[34] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[35] Christos H. Papadimitriou,et al. Cycles in adversarial regularized learning , 2017, SODA.

[36] Serdar Yüksel,et al. Decentralized Q-Learning for Stochastic Teams and Games , 2015, IEEE Transactions on Automatic Control.

[37] Gergely Neu,et al. Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.

[38] Tor Lattimore,et al. Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[39] Constantinos Daskalakis,et al. Near-optimal no-regret algorithms for zero-sum games , 2011, SODA '11.

[40] Michael P. Wellman,et al. Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[41] Vincent Conitzer,et al. AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[42] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[43] Manuela M. Veloso,et al. Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[44] Csaba Szepesvári,et al. A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[45] P. Tseng. On linear convergence of iterative methods for the variational inequality problem , 1995 .

[46] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[47] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[48] J. Wal. Discounted Markov games: Generalized policy iteration method , 1978 .

[49] M. Pollatschek,et al. Algorithms for Stochastic Games with Geometrical Interpretation , 1969 .

[50] L. Shapley,et al. Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[51] J. Neumann. Zur Theorie der Gesellschaftsspiele , 1928 .

[52] Shaocong Ma,et al. Sample Efficient Stochastic Policy Extragradient Algorithm for Zero-Sum Markov Game , 2022, ICLR.

[53] Yang Cai,et al. Finite-Time Last-Iterate Convergence for Learning in Multi-Player Games , 2022, NeurIPS.

[54] V. Cevher,et al. A Natural Actor-Critic Framework for Zero-Sum Markov Games , 2022, ICML.

[55] P. Mertikopoulos,et al. On the Rate of Convergence of Regularized Learning in Games: From Bandits and Uncertainty to Optimism and Beyond , 2021, NeurIPS.

[56] Quanquan Gu,et al. Almost Optimal Algorithms for Two-player Zero-Sum Markov Games with Linear Function Approximation , 2021 .

[57] Alon Gonen. Understanding Machine Learning From Theory to Algorithms 1st Edition Shwartz Solutions Manual , 2015 .

[58] J. Filar,et al. On the Algorithm of Pollatschek and Avi-ltzhak , 1991 .

[59] R. Karp,et al. On Nonterminating Stochastic Games , 1966 .