Utility based Q-learning to facilitate cooperation in Prisoner's Dilemma games

This work deals with Q-learning in a multiagent environment. There are many multiagent Q-learning methods, and most of them aim to converge to a Nash equilibrium, which is not desirable in games like the Prisoner's Dilemma (PD). However, normal Q-learning agents that use a stochastic method in choosing actions to avoid local optima may yield mutual cooperation in a PD game. Although such mutual cooperation usually occurs singly, it can be facilitated if the Q-function of cooperation becomes larger than that of defection after the cooperation. This work derives a theorem on how many consecutive repetitions of mutual cooperation are needed to make the Q-function of cooperation larger than that of defection. In addition, from the perspective of the author's previous works that discriminate utilities from rewards and use utilities for learning in PD games, this work also derives a corollary on how much utility is necessary to make the Q-function larger by one-shot mutual cooperation.

[1]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[2]  Michael A. Goodrich,et al.  Satisficing Q-learning: efficient learning in problems with dichotomous attributes , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[3]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[4]  Michael A. Goodrich,et al.  Learning To Cooperate in a Social Dilemma: A Satisficing Approach to Bargaining , 2003, ICML.

[5]  Michael A. Goodrich,et al.  Satisficing and Learning Cooperation in the Prisoner s Dilemma , 2001, IJCAI.

[6]  Sandip Sen,et al.  Learning Pareto-optimal Solutions in 2x2 Conflict Games , 2005, LAMAS.

[7]  Michael Rovatsos,et al.  Advice taking in multiagent reinforcement learning , 2007, AAMAS '07.

[8]  Brahim Chaib-draa,et al.  Multiagent learning in adaptive dynamic systems , 2007, AAMAS '07.

[9]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[10]  Sandip Sen,et al.  Towards a pareto-optimal solution in general-sum games , 2003, AAMAS '03.

[11]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[12]  G. Pagnoni,et al.  A Neural Basis for Social Cooperation , 2002, Neuron.

[13]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[14]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[15]  Masayuki Numao,et al.  Constructing an Autonomous Agent with an Interdependent Heuristics , 2000, PRICAI.

[16]  Yoav Shoham,et al.  Multi-Agent Reinforcement Learning:a critical survey , 2003 .

[17]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[18]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[19]  Gerald Tesauro,et al.  Extending Q-Learning to General Adaptive Multi-Agent Systems , 2003, NIPS.

[20]  Andrew G. Barto,et al.  An intrinsic reward mechanism for efficient exploration , 2006, ICML.

[21]  Masayuki Numao,et al.  Self-evaluated Learning Agent in Multiple State Games , 2003, ECML.

[22]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[23]  W. Hamilton,et al.  The Evolution of Cooperation , 1984 .

[24]  Masayuki Numao,et al.  Construction of a learning agent handling its rewards according to environmental situations , 2002, AAMAS '02.

[25]  Koichi Moriyama,et al.  Learning desirable actions in two-player two-action games , 2005, Proceedings Autonomous Decentralized Systems, 2005. ISADS 2005..