Collaborative Multiagent Reinforcement Learning by Payoff Propagation

In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcement-learning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  Umberto Bertelè,et al.  Nonserial Dynamic Programming , 1972 .

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[4]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[5]  Makoto Yokoo,et al.  Distributed Constraint Optimization as a Formal Model of Partially Adversarial Cooperation , 1991 .

[6]  Michael L. Littman,et al.  Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach , 1993, NIPS.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Sandip Sen,et al.  Learning to Coordinate without Sharing Information , 1994, AAAI.

[9]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[10]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[11]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[12]  Nevin Lianwen Zhang,et al.  Exploiting Causal Independence in Bayesian Network Inference , 1996, J. Artif. Intell. Res..

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  Craig Boutilier,et al.  Planning, Learning and Coordination in Multiagent Decision Processes , 1996, TARK.

[15]  Hiroaki Kitano,et al.  RoboCup: The Robot World Cup Initiative , 1997, AGENTS '97.

[16]  Rina Dechter,et al.  A Scheme for Approximating Probabilistic Inference , 1997, UAI.

[17]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[19]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[20]  Andrew W. Moore,et al.  Distributed Value Functions , 1999, ICML.

[21]  Gerhard Weiss,et al.  Multiagent Systems , 1999 .

[22]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[23]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[24]  Wolfram Burgard,et al.  Collaborative multi-robot exploration , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).

[25]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[26]  Edmund H. Durfee,et al.  Scaling Up Agent Coordination Strategies , 2001, Computer.

[27]  Julie A. Adams,et al.  Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence , 2001, AI Mag..

[28]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[29]  Shobha Venkataraman,et al.  Context-specific multiagent coordination and planning with factored MDPs , 2002, AAAI/IAAI.

[30]  Lynne E. Parker,et al.  Editorial: Advances in Multi-Robot Systems , 2002 .

[31]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[32]  Lynne E. Parker,et al.  Distributed Algorithms for Multi-Robot Observation of Multiple Moving Targets , 2002, Auton. Robots.

[33]  Milind Tambe,et al.  The Communicative Multiagent Team Decision Problem: Analyzing Teamwork Theories and Models , 2011, J. Artif. Intell. Res..

[34]  Lynne E. Parker,et al.  Guest editorial advances in multirobot systems , 2002, IEEE Trans. Robotics Autom..

[35]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[36]  Nikos Vlassis,et al.  A Concise Introduction to Multiagent Systems and Distributed AI , 2003 .

[37]  S. Shankar Sastry,et al.  Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[38]  Avi Pfeffer,et al.  Loopy Belief Propagation as a Basis for Communication in Sensor Networks , 2002, UAI.

[39]  Claudia V. Goldman,et al.  Optimizing information exchange in cooperative multi-agent systems , 2003, AAMAS '03.

[40]  Milind Tambe,et al.  Distributed Sensor Networks: A Multiagent Perspective , 2003 .

[41]  D. Koller,et al.  Planning under uncertainty in complex structured environments , 2003 .

[42]  Benjamin Van Roy,et al.  Distributed Optimization in Adaptive Networks , 2003, NIPS.

[43]  Craig Boutilier,et al.  Coordination in multiagent reinforcement learning: a Bayesian approach , 2003, AAMAS '03.

[44]  Claudia V. Goldman,et al.  Transition-independent decentralized markov decision processes , 2003, AAMAS '03.

[45]  Martin J. Wainwright,et al.  Tree consistency and bounds on the performance of the max-product algorithm and its generalizations , 2004, Stat. Comput..

[46]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[47]  Nikos A. Vlassis,et al.  Sparse cooperative Q-learning , 2004, ICML.

[48]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[49]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[50]  Nikos A. Vlassis,et al.  Anytime algorithms for multiagent decision making using coordination graphs , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[51]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[52]  Claudia V. Goldman,et al.  Decentralized Control of Cooperative Systems: Categorization and Complexity Analysis , 2004, J. Artif. Intell. Res..

[53]  H.-A. Loeliger,et al.  An introduction to factor graphs , 2004, IEEE Signal Process. Mag..

[54]  Nicholas R. Jennings,et al.  Cooperative Information Sharing to Improve Distributed Learning in Multi-Agent Systems , 2005, J. Artif. Intell. Res..

[55]  Peter Stone,et al.  Reinforcement Learning for RoboCup Soccer Keepaway , 2005, Adapt. Behav..

[56]  Milind Tambe,et al.  Preprocessing techniques for accelerating the DCOP algorithm ADOPT , 2005, AAMAS '05.

[57]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[58]  Nikos A. Vlassis,et al.  Non-communicative multi-robot coordination in dynamic environments , 2005, Robotics Auton. Syst..

[59]  Nikos A. Vlassis,et al.  Using the Max-Plus Algorithm for Multiagent Decision Making in Coordination Graphs , 2005, BNAIC.

[60]  Makoto Yokoo,et al.  Adopt: asynchronous distributed constraint optimization with quality guarantees , 2005, Artif. Intell..

[61]  J. R. Kok,et al.  Cooperation and learning in cooperative multiagent systems , 2006 .

[62]  Agostino Poggi,et al.  Multiagent Systems , 2006, Intelligenza Artificiale.