QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.

[1]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[4]  Chongjie Zhang,et al.  Learning Nearly Decomposable Value Functions Via Communication Minimization , 2019, ICLR.

[5]  H. Francis Song,et al.  The Hanabi Challenge: A New Frontier for AI Research , 2019, Artif. Intell..

[6]  Ying Wen,et al.  Factorized Q-learning for large-scale multi-agent systems , 2018, DAI.

[7]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability , 2017, ICML.

[8]  Shimon Whiteson,et al.  Deep Coordination Graphs , 2020, ICML.

[9]  Hongyuan Zha,et al.  Hierarchical Cooperative Multi-Agent Reinforcement Learning with Skill Discovery , 2020, AAMAS.

[10]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[11]  Christian Schroeder de Witt,et al.  Stratospheric Aerosol Injection as a Deep Reinforcement Learning Problem , 2019, ArXiv.

[12]  Jianye Hao,et al.  Deep Multi-Agent Reinforcement Learning with Discrete-Continuous Hybrid Action Spaces , 2019, IJCAI.

[13]  Nikos A. Vlassis,et al.  Collaborative Multiagent Reinforcement Learning by Payoff Propagation , 2006, J. Mach. Learn. Res..

[14]  Dorian Kodelja,et al.  Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[15]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[16]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[17]  Florian Richoux,et al.  TorchCraft: a Library for Machine Learning Research on Real-Time Strategy Games , 2016, ArXiv.

[18]  Yoshua Bengio,et al.  Incorporating Functional Knowledge in Neural Networks , 2009, J. Mach. Learn. Res..

[19]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[20]  Daniel Kudenko,et al.  Deep Multi-Agent Reinforcement Learning with Relevance Graphs , 2018, ArXiv.

[21]  Guillaume J. Laurent,et al.  Hysteretic q-learning :an algorithm for decentralized reinforcement learning in cooperative multi-agent teams , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[23]  Yujing Hu,et al.  Value Function Transfer for Deep Multi-Agent Reinforcement Learning Based on N-Step Returns , 2019, IJCAI.

[24]  Weinan Zhang,et al.  MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence , 2017, AAAI.

[25]  Sean Luke,et al.  Lenient Learning in Independent-Learner Stochastic Cooperative Games , 2016, J. Mach. Learn. Res..

[26]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[27]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[28]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[29]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent RL under Partial Observability , 2017 .

[30]  Michael I. Jordan,et al.  RLlib: Abstractions for Distributed Reinforcement Learning , 2017, ICML.

[31]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[32]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[33]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[34]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[35]  Shimon Whiteson,et al.  The Representational Capacity of Action-Value Networks for Multi-Agent Reinforcement Learning , 2019, AAMAS.

[36]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[37]  Ming Zhou,et al.  Mean Field Multi-Agent Reinforcement Learning , 2018, ICML.

[38]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  Qi Zhang,et al.  Efficient Communication in Multi-Agent Reinforcement Learning via Variance Based Control , 2019, NeurIPS.

[41]  Rob Fergus,et al.  Learning Multiagent Communication with Backpropagation , 2016, NIPS.

[42]  Christopher Amato,et al.  Likelihood Quantile Networks for Coordinating Multi-Agent Reinforcement Learning , 2018, AAMAS.

[43]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[44]  Karl Tuyls,et al.  Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective , 2008, J. Mach. Learn. Res..

[45]  Gerhard Neumann,et al.  Guided Deep Reinforcement Learning for Swarm Systems , 2017, ArXiv.

[46]  Peng Peng,et al.  Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[47]  Rahul Savani,et al.  Lenient Multi-Agent Deep Reinforcement Learning , 2017, AAMAS.

[48]  Salima Hassas,et al.  Multiagent Learning and Coordination with Clustered Deep Q-Network , 2019, AAMAS.

[49]  Stanley Osher,et al.  CESMA: Centralized Expert Supervises Multi-Agents , 2019, ArXiv.

[50]  H. Francis Song,et al.  Machine Theory of Mind , 2018, ICML.

[51]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[52]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[53]  Lei Han,et al.  LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning , 2019, NeurIPS.

[54]  Wenwu Yu,et al.  An Overview of Recent Progress in the Study of Distributed Multi-Agent Coordination , 2012, IEEE Transactions on Industrial Informatics.

[55]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[56]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[57]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[58]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[59]  Emil Gustavsson,et al.  Learning to Play Guess Who? and Inventing a Grounded Language as a Consequence , 2016, ArXiv.

[60]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[61]  Erfu Yang,et al.  Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey , 2004 .

[62]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[63]  Yung Yi,et al.  QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning , 2019, ICML.

[64]  Pieter Abbeel,et al.  rlpyt: A Research Code Base for Deep Reinforcement Learning in PyTorch , 2019, ArXiv.

[65]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[66]  Chao Wen,et al.  SMIX($\lambda$): Enhancing Centralized Value Functions for Cooperative Multi-Agent Reinforcement Learning , 2019 .

[67]  Yujing Hu,et al.  From Few to More: Large-scale Dynamic Multiagent Curriculum Learning , 2020, AAAI.

[68]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[69]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.

[70]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[71]  Peter Stone,et al.  Keepaway Soccer: From Machine Learning Testbed to Benchmark , 2005, RoboCup.

[72]  Nicolas Usunier,et al.  Episodic Exploration for Deep Deterministic Policies: An Application to StarCraft Micromanagement Tasks , 2016, ArXiv.

[73]  Yuhang Zhao,et al.  Learning Efficient Communication in Cooperative Multi-Agent Environment , 2019, AAMAS.

[74]  Gerhard Neumann,et al.  Deep Reinforcement Learning for Swarm Systems , 2018, J. Mach. Learn. Res..

[75]  Shimon Whiteson,et al.  Multi-Agent Common Knowledge Reinforcement Learning , 2018, NeurIPS.

[76]  Bikramjit Banerjee,et al.  Multi-agent reinforcement learning as a rehearsal for decentralized planning , 2016, Neurocomputing.