论文信息 - Inducing Cooperation via Team Regret Minimization based Multi-Agent Deep Reinforcement Learning - 字舞流文

Inducing Cooperation via Team Regret Minimization based Multi-Agent Deep Reinforcement Learning

Existing value-factorized based Multi-Agent deep Reinforce-ment Learning (MARL) approaches are well-performing invarious multi-agent cooperative environment under thecen-tralized training and decentralized execution(CTDE) scheme,where all agents are trained together by the centralized valuenetwork and each agent execute its policy independently. How-ever, an issue remains open: in the centralized training process,when the environment for the team is partially observable ornon-stationary, i.e., the observation and action informationof all the agents cannot represent the global states, existingmethods perform poorly and sample inefficiently. Regret Min-imization (RM) can be a promising approach as it performswell in partially observable and fully competitive settings.However, it tends to model others as opponents and thus can-not work well under the CTDE scheme. In this work, wepropose a novel team RM based Bayesian MARL with threekey contributions: (a) we design a novel RM method to traincooperative agents as a team and obtain a team regret-basedpolicy for that team; (b) we introduce a novel method to de-compose the team regret to generate the policy for each agentfor decentralized execution; (c) to further improve the perfor-mance, we leverage a differential particle filter (a SequentialMonte Carlo method) network to get an accurate estimation ofthe state for each agent. Experimental results on two-step ma-trix games (cooperative game) and battle games (large-scalemixed cooperative-competitive games) demonstrate that ouralgorithm significantly outperforms state-of-the-art methods.

Hanjiang Lai | Xinrun Wang | Bo An | Zhenyu Shi | Rundong Wang | Xinwen Hou | Runsheng Yu | Buhong Liu | Xinwen Hou | Hanjiang Lai | Xinrun Wang | R. Wang | Runsheng Yu | Bo An | Buhong Liu | Zhenyu Shi

[1] Dmitri Botvich,et al. Multi-agent Learning for Resource Allocationn Dense Heterogeneous 5G Network , 2015, 2015 International Conference on Engineering and Telecommunication (EnT).

[2] Danna Zhou,et al. d. , 1934, Microbial pathogenesis.

[3] Erfu Yang,et al. Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey , 2004 .

[4] Olivier Buffet,et al. Optimally Solving Dec-POMDPs as Continuous-State MDPs , 2013, IJCAI.

[5] Tuomas Sandholm,et al. Deep Counterfactual Regret Minimization , 2018, ICML.

[6] Hoong Chuin Lau,et al. Credit Assignment For Collective Multiagent RL With Global Rewards , 2018, NeurIPS.

[7] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[8] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[9] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[10] Matthew E. Taylor,et al. A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[11] Matthew E. Taylor,et al. A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[12] Michael H. Bowling,et al. Actor-Critic Policy Optimization in Partially Observable Multiagent Environments , 2018, NeurIPS.

[13] David Hsu,et al. Particle Filter Networks with Application to Visual Localization , 2018, CoRL.

[14] Nikos A. Vlassis,et al. Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[15] Weinan Zhang,et al. MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence , 2017, AAAI.

[16] Shimon Whiteson,et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[17] Taeyoung Lee,et al. Learning to Schedule Communication in Multi-agent Reinforcement Learning , 2019, ICLR.

[18] Yung Yi,et al. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning , 2019, ICML.

[19] Michael H. Bowling,et al. Regret Minimization in Games with Incomplete Information , 2007, NIPS.

[20] Tsuyoshi Murata,et al. {m , 1934, ACML.

[21] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[22] Kurt Keutzer,et al. Regret Minimization for Partially Observable Deep Reinforcement Learning , 2017, ICML.

[23] Sam Devlin,et al. Potential-based difference rewards for multiagent reinforcement learning , 2014, AAMAS.

[24] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[25] Fei Sha,et al. Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[26] Guy Lever,et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[27] Shimon Whiteson,et al. Deep Variational Reinforcement Learning for POMDPs , 2018, ICML.

[28] Nando de Freitas,et al. An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[29] Frans A. Oliehoek,et al. Bayesian Reinforcement Learning in Factored POMDPs , 2018, AAMAS.

[30] Peng Peng,et al. Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[31] Feng Wu,et al. Multi-Agent Planning with Baseline Regret Minimization , 2017, IJCAI.