Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games

Potential games are arguably one of the most important and widely studied classes of normal form games. They define the archetypal setting of multi-agent coordination as all agent utilities are perfectly aligned with each other via a common potential function. Can this intuitive framework be transplanted in the setting of Markov Games? What are the similarities and differences between multi-agent coordination with and without state dependence? We present a novel definition of Markov Potential Games (MPG) that generalizes prior attempts at capturing complex stateful multiagent coordination. Counter-intuitively, insights from normal-form potential games do not carry over as MPGs can consist of settings where stategames can be zero-sum games. In the opposite direction, Markov games where every state-game is a potential game are not necessarily MPGs. Nevertheless, MPGs showcase standard desirable properties such as the existence of deterministic Nash policies. In our main technical result, we prove (polynomially fast in the approximation error) convergence of independent policy gradient to Nash policies by adapting recent gradient dominance property arguments developed for single agent MDPs to multi-agent learning settings. 1. Extended Abstract Reinforcement learning (RL) has been a fundamental driver of numerous recent advances in Artificial Intelligence (AI) applications that range from super-human performance in competitive game-playing (Silver et al., 2016; 2018; Brown and Sandholm, 2019) and strategic decision-making in multiple tasks (Mnih et al., 2015; OpenAI, 2018; Vinyals et al., 2019) to robotics, autonomous-driving and cyber-physical systems (Busoniu et al., 2008; Zhang et al., 2019). A core ingredient for the success of single-agent RL systems, which are typically modelled as Markov Decision Processes (MDPs), is the guarantee of existence of stationary deterministic optimal policies (Bertsekas, 2000; Sutton and Barto, 2018). This allows for the design of efficient algorithms that provably converge towards the optimal policy (Agarwal et al., 2020). However, a majority of the above systems involve multi-agent interactions and despite the notable empirical advancements, there is a lack of understanding about the theoretical convergence guarantees of the existing multiagent reinforcement learning (MARL) algorithms. The main challenge when transitioning from single to multiagent RL settings is the computation of Nash policies. A Nash policy for n > 1 agents is defined to be a profile of policies (π∗ 1 , ..., π ∗ n) so that by fixing the stationary policies of all agents but i, π∗ i is an optimal policy for the resulting single-agent MDP and this is true for all 1 ≤ i ≤ n 1 (see Definition 1). Note that in multi-agent settings, Nash policies may not be unique in principle. A common approach for computing Nash policies in MDPs is the use of policy gradient methods. There has been significant progress in the analysis of policy gradient methods during the last couple of years, notably including the works of (Agarwal et al., 2020) (and references therein), but it has mainly concerned the single-agent case: the convergence properties of policy gradient in MARL remain poorly understood. Existing steps towards a theory for multi-agent settings involve the papers of (Daskalakis et al., 2020) who show convergence of independent policy gradient to the optimal policy, for two-agent zero-sum stochastic games, of (Wei et al., 2021) who improve the result of (Daskalakis et al., 2020) using optimistic policy gradient and of (Zhao et al., 2021) who study extensions of Natural Policy Gradient using function approximation. It is worth noting that the positive results of (Daskalakis et al., 2020; Wei et al., 2021) and (Zhao et al., 2021) depend on the fact that two-agent stochastic zero-sum games satisfy the “min-max equals max-min” property (Shapley, 1953) (even though the value-function landscape may not be convex-concave, which implies that Von Neumann’s celebrated minimax theorem may not be applicable). Model and Informal Statement of Results. While the previous works enhance our understanding in competitive Analogue of Nash equilibrium notion. 1 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 interactions, i.e., interactions in which gains can only come at the expense of others, MARL in cooperative settings remains largely under-explored and constitutes one of the current frontiers in AI research (Dafoe et al., 2020; Dafoe et al., 2021). Based on the above, our work is motivated by the following natural question: Can we get (provably) fast convergence guarantees for multi-agent RL settings in which cooperation is desirable? To address this question, we define and study a class of n-agent MDPs that naturally generalize normal form potential games (Monderer and Shapley, 1996), called Markov Potential Games (MPGs). In words, a multi-agent MDP is a MPG as long as there exists a (state-dependent) real-valued potential function Φ so that if an agent i changes their policy (and the rest of the agents keep their policy unchanged), the difference in agent i’s value/utility, V , is captured by the difference in the value of Φ (see Definition 2). Weighted and ordinal MPGs are defined similar to the normal form counterparts (see Remark 1). Under our definition, we answer the above motivating question in the affirmative. In particular, we show that if every agent i independently runs (with simultaneous updates) policy gradient on his utility/value V , after O(1/ ) iterations, the system will reach an -approximate Nash policy (see informal Theorem 1.1 and formal Theorem 4.5). Moreover, we show the finite sample analogue, that is if every agent i independently runs (with simultaneous updates) stochastic policy gradient, then with high probability, the system will reach an -approximate Nash policy after O(1/ ) iterations. Along the way, we prove several properties about the structure of MPGs and their Nash policies (see Theorem 1.2 and Section 3). Our results can be summarized in the following two Theorems. Theorem 1.1 (Convergence of Policy Gradient (Informal)). Consider a MPG with n agents and let > 0. Suppose that each agent i runs independent policy gradient using direct parameterization on his policy and that the updates are simultaneous. Then, the learning dynamics reach an -Nash policy after O(1/ ) iterations. Moreover, suppose that each agent i runs stochastic policy gradient using greedy parameterization (see (4)) on his policy and that the updates are simultaneous. Then the learning dynamics reach an Nash policy after O(1/ ) iterations. This result holds trivially for weighted MPGs and asymptotically also for ordinal MPGs, see Remark 4. Theorem 1.2 (Structural Properties of MPGs). The following facts are true for MPGs with n-agents: a. There always exists a Nash policy profile (π∗ 1 , . . . , π ∗ n) s0 ( 0 1 0 2, 0 2, 0 1 2, 0 2, 0 ) s1 ( 0 1 0 0, 2 0, 2 1 0, 2 0, 2 ) aA ⊕ aB = 0 otherwise otherwise aA ⊕ aB = 0 Figure 1. A MDP which is potential at every state but which not a MPG due to conflicting preferences over states. The agents’ instantaneous rewards, (RA(s,a), RB(s,a)), are shown in matrix form below each state s = 0, 1.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Dmitriy Drusvyatskiy,et al.  Stochastic subgradient method converges at the rate O(k-1/4) on weakly convex functions , 2018, ArXiv.

[3]  Noam Brown,et al.  Superhuman AI for multiplayer poker , 2019, Science.

[4]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[5]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[6]  Noah Golowich,et al.  Independent Policy Gradient Methods for Competitive Reinforcement Learning , 2021, NeurIPS.

[7]  Georgios Piliouras,et al.  No-regret learning and mixed Nash equilibria: They do not mix , 2020, NeurIPS.

[8]  Ying Wen,et al.  Learning in Nonzero-Sum Stochastic Games with Potentials , 2021, ICML.

[9]  Éva Tardos,et al.  Multiplicative updates outperform generic no-regret learning in congestion games: extended abstract , 2009, STOC '09.

[10]  Santiago Zazo,et al.  Learning Parametric Closed-Loop Policies for Markov Potential Games , 2018, ICLR.

[11]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[12]  Christos H. Papadimitriou,et al.  Worst-case Equilibria , 1999, STACS.

[13]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[14]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[15]  Tim Roughgarden,et al.  How bad is selfish routing? , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[16]  L. Shapley,et al.  Potential Games , 1994 .

[17]  Ilai Bistritz,et al.  Cooperative Multi-player Bandit Optimization , 2020, NeurIPS.

[18]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[19]  Christos H. Papadimitriou,et al.  Cycles in adversarial regularized learning , 2017, SODA.

[20]  David Mguni Stochastic Potential Games , 2020, ArXiv.

[21]  Haipeng Luo,et al.  Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games , 2021, COLT.

[22]  Gillian K. Hadfield,et al.  Cooperative AI: machines must learn to find common ground , 2021, Nature.

[23]  Jason R. Marden State based potential games , 2012, Autom..

[24]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[25]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[26]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[27]  Yuandong Tian,et al.  Provably Efficient Policy Gradient Methods for Two-Player Zero-Sum Markov Games , 2021, ArXiv.

[28]  Johanne Cohen,et al.  Learning with Bandit Feedback in Potential Games , 2017, NIPS.

[29]  Xiao Wang,et al.  Multiplicative Weights Updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always , 2018, ICML.

[30]  Haipeng Luo,et al.  Fast Convergence of Regularized Learning in Games , 2015, NIPS.

[31]  Michael I. Jordan,et al.  First-order methods almost always avoid strict saddle points , 2019, Mathematical Programming.

[32]  I JordanMichael,et al.  First-order methods almost always avoid strict saddle points , 2019 .

[33]  Xiaofeng Wang,et al.  Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games , 2002, NIPS.

[34]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[35]  Refael Hassin,et al.  To Queue or Not to Queue: Equilibrium Behavior in Queueing Systems , 2002 .

[36]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[37]  Joel Z. Leibo,et al.  Open Problems in Cooperative AI , 2020, ArXiv.

[38]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[39]  Georgios Piliouras,et al.  Learning in Matrix Games can be Arbitrarily Complex , 2021, COLT.

[40]  Ruta Mehta,et al.  Natural Selection as an Inhibitor of Genetic Diversity: Multiplicative Weights Updates Algorithm and a Conjecture of Haploid Genetics [Working Paper Abstract] , 2014, ITCS.