Nash Q-Learning for General-Sum Stochastic Games

We extend Q-learning to a noncooperative multiagent context, using the framework of general-sum stochastic games. A learning agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values. This learning protocol provably converges given certain restrictions on the stage games (defined by Q-values) that arise during learning. Experiments with a pair of two-player grid games suggest that such restrictions on the game structure are not necessarily required. Stage games encountered during learning in both grid environments violate the conditions. However, learning consistently converges in the first grid game, which has a unique equilibrium Q-function, but sometimes fails to converge in the second, which has three different equilibrium Q-functions. In a comparison of offline learning performance in both games, we find agents are more likely to reach a joint optimal path with Nash Q-learning than with a single-agent Q-learning method. When at least one agent adopts Nash Q-learning, the performance of both agents is better than using single-agent Q-learning. We have also implemented an online version of Nash Q-learning that balances exploration with exploitation, yielding improved performance.

[1]  A. M. Fink,et al.  Equilibrium in a stochastic $n$-person game , 1964 .

[2]  John C. Harsanyi,et al.  Общая теория выбора равновесия в играх / A General Theory of Equilibrium Selection in Games , 1989 .

[3]  C. Watkins Learning from delayed rewards , 1989 .

[4]  John Cubbin,et al.  Optimality and Equilibria in Stochastic Games , 1994 .

[5]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[8]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[9]  Kenji Fukumoto,et al.  Multi-agent Reinforcement Learning: A Modular Approach , 1996 .

[10]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[11]  R. McKelvey,et al.  Computation of equilibria in finite games , 1996 .

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Francisco J. Vico,et al.  Residual Q-Learning Applied to Visual Attention , 1996, ICML.

[14]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[15]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[16]  Tucker Balch,et al.  Learning Roles: Behavioral Diversity in Robot Teams , 1997 .

[17]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[18]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[19]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[20]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[21]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[22]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[23]  Michael P. Wellman,et al.  Learning in dynamic noncooperative multiagent systems , 1999 .

[24]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[25]  Ronen I. Brafman,et al.  A near-optimal polynomial time algorithm for learning in certain classes of stochastic games , 2000, Artif. Intell..

[26]  Jeffrey O. Kephart,et al.  Pseudo-convergent Q-Learning by Competitive Pricebots , 2000, ICML.

[27]  Marilyn A. Walker,et al.  An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email , 2000, J. Artif. Intell. Res..

[28]  Manu Sridharan,et al.  Multi-agent Q-learning and regression trees for automated pricing decisions , 2000, Proceedings Fourth International Conference on MultiAgent Systems.

[29]  Michael H. Bowling,et al.  Convergence Problems of General-Sum Multiagent Reinforcement Learning , 2000, ICML.

[30]  Michael P. Wellman,et al.  Experimental Results on Q-Learning for General-Sum Stochastic Games , 2000, ICML.

[31]  Michael L. Littman,et al.  Graphical Models for Game Theory , 2001, UAI.

[32]  Bikramjit Banerjee,et al.  Fast Concurrent Reinforcement Learners , 2001, IJCAI.

[33]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[34]  Daphne Koller,et al.  Multi-Agent Influence Diagrams for Representing and Solving Games , 2001, IJCAI.

[35]  Michael L. Littman,et al.  Value-function reinforcement learning in Markov games , 2001, Cognitive Systems Research.

[36]  S.H.G. ten Hagen Continuous State Space Q-Learning for control of Nonlinear Systems , 2001 .

[37]  Peter Stone,et al.  Scaling Reinforcement Learning toward RoboCup Soccer , 2001, ICML.

[38]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[39]  Akira Hayashi,et al.  A multiagent reinforcement learning algorithm using extended optimal response , 2002, AAMAS '02.

[40]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[41]  Yoav Shoham,et al.  Multi-Agent Reinforcement Learning:a critical survey , 2003 .

[42]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[43]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[44]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[45]  Michael P. Wellman,et al.  Conjectural Equilibrium in Multiagent Learning , 1998, Machine Learning.

[46]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[47]  Paul Bourgine,et al.  Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty , 1999, Machine Learning.

[48]  Jürgen Schmidhuber,et al.  Fast Online Q(λ) , 1998, Machine Learning.

[49]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[50]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.