Inducing Cooperation via Team Regret Minimization based Multi-Agent Deep Reinforcement Learning

Existing value-factorized based Multi-Agent deep Reinforce-ment Learning (MARL) approaches are well-performing invarious multi-agent cooperative environment under thecen-tralized training and decentralized execution(CTDE) scheme,where all agents are trained together by the centralized valuenetwork and each agent execute its policy independently. How-ever, an issue remains open: in the centralized training process,when the environment for the team is partially observable ornon-stationary, i.e., the observation and action informationof all the agents cannot represent the global states, existingmethods perform poorly and sample inefficiently. Regret Min-imization (RM) can be a promising approach as it performswell in partially observable and fully competitive settings.However, it tends to model others as opponents and thus can-not work well under the CTDE scheme. In this work, wepropose a novel team RM based Bayesian MARL with threekey contributions: (a) we design a novel RM method to traincooperative agents as a team and obtain a team regret-basedpolicy for that team; (b) we introduce a novel method to de-compose the team regret to generate the policy for each agentfor decentralized execution; (c) to further improve the perfor-mance, we leverage a differential particle filter (a SequentialMonte Carlo method) network to get an accurate estimation ofthe state for each agent. Experimental results on two-step ma-trix games (cooperative game) and battle games (large-scalemixed cooperative-competitive games) demonstrate that ouralgorithm significantly outperforms state-of-the-art methods.

[1]  Dmitri Botvich,et al.  Multi-agent Learning for Resource Allocationn Dense Heterogeneous 5G Network , 2015, 2015 International Conference on Engineering and Telecommunication (EnT).

[2]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[3]  Erfu Yang,et al.  Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey , 2004 .

[4]  Olivier Buffet,et al.  Optimally Solving Dec-POMDPs as Continuous-State MDPs , 2013, IJCAI.

[5]  Tuomas Sandholm,et al.  Deep Counterfactual Regret Minimization , 2018, ICML.

[6]  Hoong Chuin Lau,et al.  Credit Assignment For Collective Multiagent RL With Global Rewards , 2018, NeurIPS.

[7]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[8]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[9]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[10]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[11]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[12]  Michael H. Bowling,et al.  Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[13]  David Hsu,et al.  Particle Filter Networks with Application to Visual Localization , 2018, CoRL.

[14]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[15]  Weinan Zhang,et al.  MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence , 2017, AAAI.

[16]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[17]  Taeyoung Lee,et al.  Learning to Schedule Communication in Multi-agent Reinforcement Learning , 2019, ICLR.

[18]  Yung Yi,et al.  QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning , 2019, ICML.

[19]  Michael H. Bowling,et al.  Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[20]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Kurt Keutzer,et al.  Regret Minimization for Partially Observable Deep Reinforcement Learning , 2017, ICML.

[23]  Sam Devlin,et al.  Potential-based difference rewards for multiagent reinforcement learning , 2014, AAMAS.

[24]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[25]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[26]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[27]  Shimon Whiteson,et al.  Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[28]  Nando de Freitas,et al.  An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[29]  Frans A. Oliehoek,et al.  Bayesian Reinforcement Learning in Factored POMDPs , 2018, AAMAS.

[30]  Peng Peng,et al.  Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[31]  Feng Wu,et al.  Multi-Agent Planning with Baseline Regret Minimization , 2017, IJCAI.