Decentralized Q-Learning in Zero-sum Markov Games

We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent’s actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent’s strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent’s perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale.

[1]  Yuandong Tian,et al.  Provably Efficient Policy Gradient Methods for Two-Player Zero-Sum Markov Games , 2021, ArXiv.

[2]  J. Hofbauer,et al.  Uncoupled Dynamics Do Not Lead to Nash Equilibrium , 2003 .

[3]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[4]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[5]  Devavrat Shah,et al.  On Reinforcement Learning for Turn-based Zero-sum Markov Games , 2020, FODS.

[6]  Serdar Yüksel,et al.  Decentralized Q-Learning for Stochastic Teams and Games , 2015, IEEE Transactions on Automatic Control.

[7]  Craig Boutilier,et al.  Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[8]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[9]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[10]  William H. Sandholm,et al.  Preference Evolution, Two-Speed Dynamics, and Rapid Social Change , 2001 .

[11]  Tiancheng Yu,et al.  Provably Efficient Online Agnostic Learning in Markov Games , 2020, ArXiv.

[12]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[13]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[14]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[15]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[16]  Pablo Hernandez-Leal,et al.  A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity , 2017, ArXiv.

[17]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[18]  D. Fudenberg,et al.  Learning and Equilibrium , 2009 .

[19]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[20]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[21]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[22]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[23]  Lin F. Yang,et al.  Solving Discounted Stochastic Two-Player Games with Near-Optimal Time and Sample Complexity , 2019, AISTATS.

[24]  Noah Golowich,et al.  Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[25]  Dean Phillips Foster,et al.  Regret Testing: Learning to Play Nash Equilibrium Without Knowing You Have an Opponent , 2006 .

[26]  Daniel J. Singer To the Best of Our Knowledge , 2021, The Philosophical Review.

[27]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[28]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[29]  J. Robinson AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[30]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[31]  M. Benaïm Dynamics of stochastic approximation algorithms , 1999 .

[32]  Chen-Yu Wei,et al.  Online Reinforcement Learning in Stochastic Games , 2017, NIPS.

[33]  Manuela M. Veloso,et al.  Rational and Convergent Learning in Stochastic Games , 2001, IJCAI.

[34]  Etienne Perot,et al.  Deep Reinforcement Learning framework for Autonomous Driving , 2017, Autonomous Vehicles and Machines.

[35]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[36]  C. Harris On the Rate of Convergence of Continuous-Time Fictitious Play , 1998 .

[37]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[38]  Stef Tijs,et al.  Fictitious play applied to sequences of games and discounted stochastic games , 1982 .

[39]  Qiaomin Xie,et al.  Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium , 2020, COLT 2020.

[40]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[41]  Bruno Scherrer,et al.  Approximate Dynamic Programming for Two-Player Zero-Sum Markov Games , 2015, ICML.

[42]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[43]  L. Buşoniu,et al.  A comprehensive survey of multi-agent reinforcement learning , 2011 .

[44]  Ying Wang,et al.  A machine-learning approach to multi-robot coordination , 2008, Eng. Appl. Artif. Intell..

[45]  Shimon Whiteson,et al.  Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs , 2008, ECML/PKDD.

[46]  Tamer Basar,et al.  Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games , 2019, NeurIPS.

[47]  Chi Jin,et al.  Near-Optimal Reinforcement Learning with Self-Play , 2020, NeurIPS.

[48]  Sham M. Kakade,et al.  Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity , 2020, NeurIPS.

[49]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[50]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[51]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[52]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[53]  Chi Jin,et al.  Provable Self-Play Algorithms for Competitive Reinforcement Learning , 2020, ICML.

[54]  L. Shapley,et al.  Fictitious Play Property for Games with Identical Interests , 1996 .

[55]  Vivek S. Borkar,et al.  Reinforcement Learning in Markovian Evolutionary Games , 2002, Adv. Complex Syst..

[56]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[57]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[58]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[59]  Josef Hofbauer,et al.  Learning in perturbed asymmetric games , 2005, Games Econ. Behav..

[60]  Zhengyuan Zhou,et al.  Learning in games with continuous action sets and unknown payoff functions , 2019, Math. Program..

[61]  David S. Leslie,et al.  Individual Q-Learning in Normal Form Games , 2005, SIAM J. Control. Optim..

[62]  Qinghua Liu,et al.  A Sharp Analysis of Model-based Reinforcement Learning with Self-Play , 2020, ICML.

[63]  Zibo Xu,et al.  Best-response dynamics in zero-sum stochastic games , 2020, J. Econ. Theory.

[64]  A. M. Fink,et al.  Equilibrium in a stochastic $n$-person game , 1964 .

[65]  Zhuoran Yang,et al.  Decentralized Single-Timescale Actor-Critic on Zero-Sum Two-Player Stochastic Games , 2021, ICML.

[66]  Olivier Pietquin,et al.  Actor-Critic Fictitious Play in Simultaneous Move Multistage Games , 2018, AISTATS.

[67]  A. Ozdaglar,et al.  Fictitious play in zero-sum stochastic games , 2020, SIAM J. Control. Optim..

[68]  Haipeng Luo,et al.  Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games , 2021, COLT.

[69]  Jeffrey C. Ely,et al.  Nash Equilibrium and the Evolution of Preferences , 2001, J. Econ. Theory.