Multiagent learning in the presence of agents with limitations

Learning to act in a multiagent environment is a challenging problem. Optimal behavior for one agent depends upon the behavior of the other agents, which are learning as well. Multiagent environments are therefore non-stationary, violating the traditional assumption underlying single-agent learning. In addition, agents in complex tasks may have limitations, such as physical constraints or designer-imposed approximations of the task that make learning tractable. Limitations prevent agents from acting optimally, which complicates the already challenging problem. A learning agent must effectively compensate for its own limitations while exploiting the limitations of the other agents. My thesis research focuses on these two challenges, namely multiagent learning and limitations, and includes four main contributions. First, the thesis introduces the novel concepts of a variable learning rate and the WoLF (Win or Learn Fast) principle to account for other learning agents. The WoLF principle is capable of making rational learning algorithms converge to optimal policies, and by doing so achieves two properties, rationality and convergence, which had not been achieved by previous techniques. The converging effect of WoLF is proven for a class of matrix games, and demonstrated empirically for a wide-range of stochastic games. Second, the thesis contributes an analysis of the effect of limitations on the game-theoretic concept of Nash equilibria. The existence of equilibria is important if multiagent learning techniques, which often depend on the concept, are to be applied to realistic problems where limitations are unavoidable. The thesis introduces a general model for the effect of limitations on agent behavior, which is used to analyze the resulting impact on equilibria. The thesis shows that equilibria do exist for a few restricted classes of games and limitations, but even well-behaved limitations do not preserve the existence of equilibria, in general. Third, the thesis introduces GraWoLF, a general-purpose, scalable, multiagent learning algorithm. GraWoLF combines policy gradient learning techniques with the WoLF variable learning rate. The effectiveness of the learning algorithm is demonstrated in both a card game with an intractably large state space, and an adversarial robot task. These two tasks are complex and agent limitations are prevalent in both. Fourth, the thesis describes the CMDragons robot soccer team strategy for adapting to an unknown opponent. (Abstract shortened by UMI.)

[1]  Brett Browning,et al.  ÜberSim: a multi-robot simulator for robot soccer , 2003, AAMAS '03.

[2]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[3]  Hervé Reinhard,et al.  Differential equations: Foundations and applications , 1986 .

[4]  Sandip Sen,et al.  Learning to Coordinate without Sharing Information , 1994, AAAI.

[5]  O. Mangasarian,et al.  Two-person nonzero-sum games and quadratic programming , 1964 .

[6]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[7]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[8]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[9]  O. J. Vrieze,et al.  Stochastic Games with Finite State and Action Spaces. , 1988 .

[10]  J. Goodman Note on Existence and Uniqueness of Equilibrium Points for Concave N-Person Games , 1965 .

[11]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[12]  David Carmel,et al.  Learning Models of Intelligent Agents , 1996, AAAI/IAAI, Vol. 1.

[13]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[14]  M. Veloso,et al.  Bounding the suboptimality of reusing subproblems , 1999, IJCAI 1999.

[15]  E. Kalai,et al.  Rational Learning Leads to Nash Equilibrium , 1993 .

[16]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[17]  Shie Mannor,et al.  Adaptive Strategies and Regret Minimization in Arbitrarily Varying Markov Environments , 2001, COLT/EuroCOLT.

[18]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[19]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[20]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[21]  Manuela M. Veloso,et al.  Existence of Multiagent Equilibria with Limited Agents , 2004, J. Artif. Intell. Res..

[22]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[23]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[24]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[25]  S. Hart,et al.  Uncoupled Dynamics Do Not Lead to Nash Equilibrium , 2003 .

[26]  Xiaofeng Wang,et al.  Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games , 2002, NIPS.

[27]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[28]  Tuomas Sandholm,et al.  Bargaining with limited computation: Deliberation equilibrium , 2001, Artif. Intell..

[29]  T. Speed,et al.  Interview of Albert Tucker , 1975 .

[30]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[31]  Manuela M. Veloso,et al.  Real-time randomized path planning for robot navigation , 2002, IEEE/RSJ International Conference on Intelligent Robots and Systems.

[32]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[33]  S. Ross GOOFSPIEL -- THE GAME OF PURE STRATEGY , 1971 .

[34]  S. Hart,et al.  Uncoupled Dynamics Cannot Lead to Nash Equilibrium ∗ , 2002 .

[35]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[36]  T. Cormen,et al.  Model-based Learning of Interaction Strategies in Multi-agent Systems , 1997 .

[37]  Michael P. Wellman,et al.  Learning in dynamic noncooperative multiagent systems , 1999 .

[38]  Jörgen W. Weibull,et al.  Evolutionary Game Theory , 1996 .

[39]  Dov Samet,et al.  Learning to play games in extensive form by valuation , 2001, J. Econ. Theory.

[40]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[41]  J. Albus A Theory of Cerebellar Function , 1971 .

[42]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[43]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[44]  R. McKelvey,et al.  Computation of equilibria in finite games , 1996 .

[45]  Hiroaki Kitano,et al.  RoboCup: A Challenge Problem for AI , 1997, AI Mag..

[46]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[47]  J. Wal Discounted Markov games; successive approximation and stopping times , 1977 .

[48]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[49]  M. Pollatschek,et al.  Algorithms for Stochastic Games with Geometrical Interpretation , 1969 .

[50]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[51]  Vincent Conitzer,et al.  Complexity Results about Nash Equilibria , 2002, IJCAI.

[52]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[53]  Gunes Ercal,et al.  On No-Regret Learning, Fictitious Play, and Nash Equilibrium , 2001, ICML.

[54]  J. Robinson AN ITERATIVE METHOD OF SOLVING A GAME , 1951, Classics in Game Theory.

[55]  Manuela M. Veloso,et al.  Planning for Distributed Execution through Use of Probabilistic Opponent Models , 2002, AIPS.

[56]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Bikramjit Banerjee,et al.  Convergent Gradient Ascent in General-Sum Games , 2002, ECML.

[58]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[59]  L. C. Thomas,et al.  Stochastic Games with Finite State and Action Spaces , 1988 .

[60]  Milos Hauskrecht,et al.  Hierarchical Solution of Markov Decision Processes using Macro-actions , 1998, UAI.

[61]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[62]  Ariel Rubinstein,et al.  A Course in Game Theory , 1995 .

[63]  Manuela Veloso,et al.  Scalable Learning in Stochastic Games , 2002 .

[64]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[65]  Manuela M. Veloso,et al.  Convergence of Gradient Dynamics with a Variable Learning Rate , 2001, ICML.

[66]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[67]  Ronen I. Brafman,et al.  Efficient learning equilibrium , 2004, Artificial Intelligence.

[68]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[69]  Shlomo Zilberstein,et al.  Models of Bounded Rationality , 1995 .

[70]  Peter Stone,et al.  Scaling Reinforcement Learning toward RoboCup Soccer , 2001, ICML.

[71]  Manuela M. Veloso,et al.  On Behavior Classification in Adversarial Environments , 2000, DARS.

[72]  Itzhak Gilboa,et al.  Bounded Versus Unbounded Rationality: The Tyranny of the Weak , 1989 .

[73]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[74]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[75]  H. Kuhn Classics in Game Theory , 1997 .

[76]  Michael H. Bowling,et al.  Convergence Problems of General-Sum Multiagent Reinforcement Learning , 2000, ICML.

[77]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[78]  Craig Boutilier,et al.  Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[79]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[80]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[81]  G. Brown SOME NOTES ON COMPUTATION OF GAMES SOLUTIONS , 1949 .

[82]  Avrim Blum,et al.  On-line Learning and the Metrical Task System Problem , 1997, COLT '97.

[83]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[84]  Ian Frank,et al.  Soccer Server: A Tool for Research on Multiagent Systems , 1998, Appl. Artif. Intell..

[85]  Stuart J. Russell Rationality and Intelligence , 1995, IJCAI.

[86]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[87]  Manuela Veloso,et al.  Tree based hierarchical reinforcement learning , 2002 .

[88]  William T. B. Uther,et al.  Adversarial Reinforcement Learning , 2003 .

[89]  Peter Stone,et al.  Leading Best-Response Strategies in Repeated Games , 2001, International Joint Conference on Artificial Intelligence.

[90]  A. Rubinstein Modeling Bounded Rationality , 1998 .

[91]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[92]  Eitan Zemel,et al.  Nash and correlated equilibria: Some complexity considerations , 1989 .

[93]  Brett Browning,et al.  Improbability filtering for rejecting false positives , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[94]  Robert H. Crites,et al.  Multiagent reinforcement learning in the Iterated Prisoner's Dilemma. , 1996, Bio Systems.

[95]  Peter J. Jansen,et al.  Using knowledge about the opponent in game-tree search , 1992 .