Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward

We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the "lazy agent" problem, which arises due to partial observability. We address these problems by training individual agents with a novel value decomposition network architecture, which learns to decompose the team value function into agent-wise value functions. We perform an experimental evaluation across a range of partially-observable multi-agent domains and show that learning such value-decompositions leads to superior results, in particular when combined with weight sharing, role information and information channels.

[1]  Shimon Whiteson,et al.  Multiagent Reinforcement Learning for Urban Traffic Control Using Coordination Graphs , 2008, ECML/PKDD.

[2]  Doina Precup,et al.  Investigating Recurrence and Eligibility Traces in Deep Q-Networks , 2017, ArXiv.

[3]  Sam Devlin,et al.  Potential-based difference rewards for multiagent reinforcement learning , 2014, AAMAS.

[4]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[5]  Kagan Tumer,et al.  Modeling difference rewards for multiagent learning , 2012, AAMAS.

[6]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[7]  N. Le Fort-Piat,et al.  The world of independent learners is not markovian , 2011, Int. J. Knowl. Based Intell. Eng. Syst..

[8]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[9]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[10]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[11]  Jen Jen Chung,et al.  Local Approximation of Difference Evaluation Functions , 2016, AAMAS.

[12]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[13]  Kagan Tumer,et al.  Combining reward shaping and hierarchies for scaling to large multiagent systems , 2016, The Knowledge Engineering Review.

[14]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[15]  Frans A. Oliehoek,et al.  Coordinated Deep Reinforcement Learners for Traffic Light Control , 2016 .

[16]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[17]  Kagan Tumer,et al.  Modeling Difference Rewards for Multiagent Learning (Extended Abstract) , 2012 .

[18]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[21]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[22]  Gerhard Weiss,et al.  Multiagent Learning: Basics, Challenges, and Prospects , 2012, AI Mag..

[23]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[24]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[25]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[26]  Kagan Tumer,et al.  A Survey of Collectives , 2004 .

[27]  Andrew W. Moore,et al.  Distributed Value Functions , 1999, ICML.

[28]  Sam Devlin,et al.  Potential-based reward shaping for finite horizon online POMDP planning , 2015, Autonomous Agents and Multi-Agent Systems.

[29]  Michael L. Littman,et al.  Social reward shaping in the prisoner's dilemma , 2008, AAMAS.

[30]  Stuart J. Russell,et al.  Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[31]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[32]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[33]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[34]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[35]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[36]  Matthew J. Hausknecht,et al.  Cooperation and communication in multiagent deep reinforcement learning , 2016 .

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[39]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[40]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[41]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[42]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[43]  Kagan Tumer,et al.  Analyzing and visualizing multiagent rewards in dynamic and stochastic domains , 2008, Autonomous Agents and Multi-Agent Systems.