AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents

Two minimal requirements for a satisfactory multiagent learning algorithm are that it 1. learns to play optimally against stationary opponents and 2. converges to a Nash equilibrium in self-play. The previous algorithm that has come closest, WoLF-IGA, has been proven to have these two properties in 2-player 2-action (repeated) games—assuming that the opponent’s mixed strategy is observable. Another algorithm, ReDVaLeR (which was introduced after the algorithm described in this paper), achieves the two properties in games with arbitrary numbers of actions and players, but still requires that the opponents' mixed strategies are observable. In this paper we present AWESOME, the first algorithm that is guaranteed to have the two properties in games with arbitrary numbers of actions and players. It is still the only algorithm that does so while only relying on observing the other players' actual actions (not their mixed strategies). It also learns to play optimally against opponents that eventually become stationary. The basic idea behind AWESOME (Adapt When Everybody is Stationary, Otherwise Move to Equilibrium) is to try to adapt to the others' strategies when they appear stationary, but otherwise to retreat to a precomputed equilibrium strategy. We provide experimental results that suggest that AWESOME converges fast in practice. The techniques used to prove the properties of AWESOME are fundamentally different from those used for previous algorithms, and may help in analyzing future multiagent learning algorithms as well.

[1]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[2]  J. Robinson AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[3]  宮沢 光一 On the convergence of the learning process in a 2 x 2 non-zero-sum two-person game , 1961 .

[4]  C. E. Lemke,et al.  Equilibrium Points of Bimatrix Games , 1964 .

[5]  S. Vajda Some topics in two-person games , 1971 .

[6]  R. Aumann Subjectivity and Correlation in Randomized Strategies , 1974 .

[7]  H. Simon,et al.  Models of Bounded Rationality: Empirically Grounded Economic Reason , 1997 .

[8]  L. C. Thomas,et al.  Stochastic Games with Finite State and Action Spaces , 1988 .

[9]  Eitan Zemel,et al.  Nash and correlated equilibria: Some complexity considerations , 1989 .

[10]  John Nachbar “Evolutionary” selection dynamics in games: Convergence and limit properties , 1990 .

[11]  E. Kalai,et al.  Rational Learning Leads to Nash Equilibrium , 1993 .

[12]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[13]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[14]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[15]  D. Fudenberg,et al.  Consistency and Cautious Fictitious Play , 1995 .

[16]  John Nachbar Prediction, optimization, and learning in repeated games , 1997 .

[17]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[18]  Robert H. Crites,et al.  Multiagent reinforcement learning in the Iterated Prisoner's Dilemma. , 1996, Bio Systems.

[19]  Dean P. Foster,et al.  Calibrated Learning and Correlated Equilibrium , 1997 .

[20]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[21]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[22]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[23]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[24]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[25]  Sandip Sen,et al.  Learning in multiagent systems , 1999 .

[26]  D. Fudenberg,et al.  Conditional Universal Consistency , 1999 .

[27]  Ronen I. Brafman,et al.  A near-optimal polynomial time algorithm for learning in certain classes of stochastic games , 2000, Artif. Intell..

[28]  Leonid Sheremetov,et al.  Weiss, Gerhard. Multiagent Systems a Modern Approach to Distributed Artificial Intelligence , 2009 .

[29]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[30]  Bikramjit Banerjee,et al.  Fast Concurrent Reinforcement Learners , 2001, IJCAI.

[31]  Michael A. Goodrich,et al.  Satisficing and Learning Cooperation in the Prisoner s Dilemma , 2001, IJCAI.

[32]  Christos H. Papadimitriou,et al.  Algorithms, Games, and the Internet , 2001, ICALP.

[33]  Gunes Ercal,et al.  On No-Regret Learning, Fictitious Play, and Nash Equilibrium , 2001, ICML.

[34]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[35]  H P Young,et al.  On the impossibility of predicting the behavior of rational agents , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[36]  John Nachbar,et al.  Bayesian learning in repeated games of incomplete information , 2001, Soc. Choice Welf..

[37]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[38]  Ronen I. Brafman,et al.  Efficient learning equilibrium , 2004, Artificial Intelligence.

[39]  Xiaofeng Wang,et al.  Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games , 2002, NIPS.

[40]  Yoav Shoham,et al.  Polynomial-time reinforcement learning of near-optimal policies , 2002, AAAI/IAAI.

[41]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[42]  Amy Greenwald,et al.  A General Class of No-Regret Learning Algorithms and Game-Theoretic Equilibria , 2003, COLT.

[43]  Vincent Conitzer,et al.  BL-WoLF: A Framework For Loss-Bounded Learnability In Zero-Sum Games , 2003, ICML.

[44]  S. Hart,et al.  Uncoupled Dynamics Do Not Lead to Nash Equilibrium , 2003 .

[45]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[46]  Peter Stone,et al.  A polynomial-time nash equilibrium algorithm for repeated games , 2003, EC '03.

[47]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[48]  Tuomas Sandholm,et al.  Learning Near-Pareto-Optimal Conventions in Polynomial Time , 2003, NIPS.

[49]  Vincent Conitzer,et al.  Complexity Results about Nash Equilibria , 2002, IJCAI.

[50]  Vincent Conitzer,et al.  Communication complexity as a lower bound for learning in games , 2004, ICML.

[51]  Yoav Shoham,et al.  New Criteria and a New Algorithm for Learning in Multi-Agent Systems , 2004, NIPS.

[52]  Bikramjit Banerjee,et al.  Performance Bounded Reinforcement Learning in Strategic Interactions , 2004, AAAI.

[53]  Michael H. Bowling,et al.  Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[54]  Amotz Cahn,et al.  General procedures leading to correlated equilibria , 2004, Int. J. Game Theory.

[55]  Sham M. Kakade,et al.  Deterministic calibration and Nash equilibrium , 2004, J. Comput. Syst. Sci..

[56]  Vincent Conitzer,et al.  Mixed-Integer Programming Methods for Finding Nash Equilibria , 2005, AAAI.

[57]  Ronen I. Brafman,et al.  Optimal Efficient Learning Equilibrium: Imperfect Monitoring in Symmetric Games , 2005, AAAI.

[58]  Yoav Shoham,et al.  Learning against opponents with bounded memory , 2005, IJCAI.

[59]  Yoav Shoham,et al.  Simple search methods for finding a Nash equilibrium , 2004, Games Econ. Behav..