Individual Q-Learning in Normal Form Games

The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning. However, the multi-agent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behavior of value-based learning agents in this situation, and show that such agents cannot generally play at a Nash equilibrium, although if smooth best responses are used, a Nash distribution can be reached. We introduce a particular value-based learning algorithm, which we call individual Q-learning, and use stochastic approximation to study the asymptotic behavior, showing that strategies will converge to Nash distribution almost surely in 2-player zero-sum games and 2-player partnership games. Player-dependent learning rates are then considered, and it is shown that this extension converges in some games for which many algorithms, including the basic algorithm initially considered, fail to converge.

[1]  S. Hart,et al.  A Reinforcement Procedure Leading to Correlated Equilibrium , 2001 .

[2]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[3]  R. Pemantle,et al.  Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .

[4]  A. Roth,et al.  Learning in Extensive-Form Games: Experimental Data and Simple Dynamic Models in the Intermediate Term* , 1995 .

[5]  N. Megiddo On repeated games with incomplete information played by non-Bayesian players , 1980 .

[6]  J. Harsanyi Games with randomly disturbed payoffs: A new rationale for mixed-strategy equilibrium points , 1973 .

[7]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[8]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[9]  S. Vajda Some topics in two-person games , 1971 .

[10]  S. Hart,et al.  Uncoupled Dynamics Cannot Lead to Nash Equilibrium ∗ , 2002 .

[11]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[12]  A. Banos On Pseudo-Games , 1968 .

[13]  H. Chen,et al.  STOCHASTIC APPROXIMATION PROCEDURES WITH RANDOMLY VARYING TRUNCATIONS , 1986 .

[14]  S. Hart,et al.  Uncoupled Dynamics Do Not Lead to Nash Equilibrium , 2003 .

[15]  H. Peyton Young,et al.  Learning, hypothesis testing, and Nash equilibrium , 2003, Games Econ. Behav..

[16]  M. Hirsch,et al.  Mixed Equilibria and Dynamical Systems Arising from Fictitious Play in Perturbed Games , 1999 .

[17]  Ulrich Berger,et al.  Fictitious play in 2×n games , 2005, J. Econ. Theory.

[18]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[19]  R. Aumann Subjectivity and Correlation in Randomized Strategies , 1974 .

[20]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[21]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Daphne Koller,et al.  Multi-Agent Influence Diagrams for Representing and Solving Games , 2001, IJCAI.

[24]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[25]  Michael L. Littman,et al.  An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games , 2001, NIPS.

[26]  Josef Hofbauer,et al.  Learning in perturbed asymmetric games , 2005, Games Econ. Behav..

[27]  J M Smith,et al.  Evolution and the theory of games , 1976 .

[28]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[29]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[30]  O. H. Brownlee,et al.  ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION , 1952 .

[31]  M. Benaïm Dynamics of stochastic approximation algorithms , 1999 .

[32]  J. Jordan Three Problems in Learning Mixed-Strategy Nash Equilibria , 1993 .

[33]  Tilman Börgers,et al.  Learning Through Reinforcement and Replicator Dynamics , 1997 .

[34]  E. J. Collins,et al.  Convergent multiple-timescales reinforcement learning algorithms in normal form games , 2003 .

[35]  Arthur J. Robson,et al.  A short proof of Harsanyi's purification theorem , 2003, Games Econ. Behav..

[36]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[37]  Paul Glimcher,et al.  Physiological utility theory and the neuroeconomics of choice , 2005, Games Econ. Behav..

[38]  Vivek S. Borkar,et al.  Reinforcement Learning in Markovian Evolutionary Games , 2002, Adv. Complex Syst..