SA-IGA: a multiagent reinforcement learning method towards socially optimal outcomes

In multiagent environments, the capability of learning is important for an agent to behave appropriately in face of unknown opponents and dynamic environment. From the system designer’s perspective, it is desirable if the agents can learn to coordinate towards socially optimal outcomes, while also avoiding being exploited by selfish opponents. To this end, we propose a novel gradient ascent based algorithm (SA-IGA) which augments the basic gradient-ascent algorithm by incorporating social awareness into the policy update process. We theoretically analyze the learning dynamics of SA-IGA using dynamical system theory and SA-IGA is shown to have linear dynamics for a wide range of games including symmetric games. The learning dynamics of two representative games (the prisoner’s dilemma game and the coordination game) are analyzed in detail. Based on the idea of SA-IGA, we further propose a practical multiagent learning algorithm, called SA-PGA, based on Q-learning update rule. Simulation results show that SA-PGA agent can achieve higher social welfare than previous social-optimality oriented Conditional Joint Action Learner (CJAL) and also is robust against individually rational opponents by reaching Nash equilibrium solutions.

[1]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[2]  Victor R. Lesser,et al.  A Multiagent Reinforcement Learning Algorithm with Non-linear Dynamics , 2008, J. Artif. Intell. Res..

[3]  Joel Z. Leibo,et al.  Inequity aversion improves cooperation in intertemporal social dilemmas , 2018, NeurIPS.

[4]  Ryszard Kowalczyk,et al.  Dynamic analysis of multiagent {\it Q}-learning with {\&}epsilon;-greedy exploration , 2009, ICML 2009.

[5]  E. Coddington,et al.  Theory of Ordinary Differential Equations , 1955 .

[6]  Peter Stone,et al.  Multiagent learning in the presence of memory-bounded agents , 2013, Autonomous Agents and Multi-Agent Systems.

[7]  J. Andreoni,et al.  Chapter 82 Partners versus Strangers: Random Rematching in Public Goods Experiments , 1998 .

[8]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[9]  Abbas Jamalipour,et al.  An Evolutionary Game Theory-Based Approach to Cooperation in VANETs Under Different Network Conditions , 2015, IEEE Transactions on Vehicular Technology.

[10]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[11]  Alexander Peysakhovich,et al.  Prosocial Learning Agents Solve Generalized Stag Hunts Better than Selfish Ones Extended Abstract , 2018 .

[12]  P WellmanMichael,et al.  Foundations of multi-agent learning , 2007 .

[13]  Yoav Shoham,et al.  Learning against opponents with bounded memory , 2005, IJCAI.

[14]  Bikramjit Banerjee,et al.  The role of reactivity in multiagent learning , 2004, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004..

[15]  Karl Tuyls,et al.  An Evolutionary Dynamical Analysis of Multi-Agent Learning in Iterated Games , 2005, Autonomous Agents and Multi-Agent Systems.

[16]  Jacob W. Crandall,et al.  Just add Pepper: extending learning algorithms for repeated matrix games to repeated Markov games , 2012, AAMAS.

[17]  Kristian Kirsch,et al.  Theory Of Ordinary Differential Equations , 2016 .

[18]  L. Chua,et al.  Methods of Qualitative Theory in Nonlinear Dynamics (Part II) , 2001 .

[19]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[20]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[21]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[22]  Victor R. Lesser,et al.  Multi-Agent Learning with Policy Prediction , 2010, AAAI.

[23]  György Szabó,et al.  Prisoner's dilemma and public goods games in different geometries: Compulsory versus voluntary interactions , 2003, Complex..

[24]  Michael H. Bowling,et al.  Convergence and No-Regret in Multiagent Learning , 2004, NIPS.

[25]  Junwei Gao,et al.  FMRQ—A Multiagent Reinforcement Learning Algorithm for Fully Cooperative Tasks , 2017, IEEE Transactions on Cybernetics.

[26]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[27]  Sandip Sen,et al.  Reaching pareto-optimality in prisoner’s dilemma using conditional joint action learning , 2007, Autonomous Agents and Multi-Agent Systems.

[28]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[29]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[30]  L. Chua,et al.  Methods of qualitative theory in nonlinear dynamics , 1998 .

[31]  M. Alvard The Ultimatum Game, Fairness, and Cooperation among Big Game Hunters , 2004 .

[32]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[33]  Karl Tuyls,et al.  Evolutionary Dynamics of Multi-Agent Learning: A Survey , 2015, J. Artif. Intell. Res..

[34]  Yuxin Mao,et al.  Cooperation Dynamics on Collaborative Social Networks of Heterogeneous Population , 2013, IEEE Journal on Selected Areas in Communications.

[35]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[36]  Michael P. Wellman,et al.  Foundations of multi-agent learning: Introduction to the special issue , 2007, Artif. Intell..

[37]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[38]  Tom Lenaerts,et al.  A selection-mutation model for q-learning in multi-agent systems , 2003, AAMAS '03.

[39]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[40]  Ryszard Kowalczyk,et al.  Dynamic analysis of multiagent Q-learning with ε-greedy exploration , 2009, ICML '09.

[41]  Bikramjit Banerjee,et al.  Efficient learning of multi-step best response , 2005, AAMAS '05.

[42]  Bikramjit Banerjee,et al.  Adaptive policy gradient in multiagent learning , 2003, AAMAS '03.

[43]  Shimon Whiteson,et al.  Learning with Opponent-Learning Awareness , 2017, AAMAS.