论文信息 - QOPT: Optimistic Value Function Decentralization for Cooperative Multi-Agent Reinforcement Learning - 字舞流文

QOPT: Optimistic Value Function Decentralization for Cooperative Multi-Agent Reinforcement Learning

We propose a novel value-based algorithm for cooperative multi-agent reinforcement learning, under the paradigm of centralized training with decentralized execution. The proposed algorithm, coined QOPT, is based on the "optimistic" training scheme using two action-value estimators with separate roles: (i) true action-value estimation and (ii) decentralization of optimal action. By construction, our framework allows the latter action-value estimator to achieve (ii) while representing a richer class of joint action-value estimators than that of the state-of-the-art algorithm, i.e., QMIX. Our experiments demonstrate that QOPT newly achieves state-of-the-art performance in the StarCraft Multi-Agent Challenge environment. In particular, ours significantly outperform the baselines for the case where non-cooperative behaviors are penalized more aggressively.

Jinwoo Shin | Yung Yi | Kyunghwan Son | Sungsoo Ahn | Roben Delos Reyes | Jinwoo Shin | Yung Yi | Sungsoo Ahn | Kyunghwan Son | R. D. Reyes

[1] Yung Yi,et al. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning , 2019, ICML.

[2] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[3] S. G. Ponnambalam,et al. Reinforcement learning in swarm-robotics for multi-agent foraging-task domain , 2013, 2013 IEEE Symposium on Swarm Intelligence (SIS).

[4] Yun Yang,et al. A Multi-Agent Framework for Packet Routing in Wireless Sensor Networks , 2015, Sensors.

[5] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[6] Guy Lever,et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[7] Frans A. Oliehoek,et al. A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[8] Jianye Hao,et al. Qatten: A General Framework for Cooperative Multiagent Reinforcement Learning , 2020, ArXiv.

[9] Peter Stone,et al. Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[10] Dorian Kodelja,et al. Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[11] Martin Lauer,et al. An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[12] Chongjie Zhang,et al. ROMA: Multi-Agent Reinforcement Learning with Emergent Roles , 2020, ICML.

[13] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[14] Shimon Whiteson,et al. MAVEN: Multi-Agent Variational Exploration , 2019, NeurIPS.

[15] Amnon Shashua,et al. Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[16] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[17] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[18] Lei Han,et al. LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning , 2019, NeurIPS.

[19] Shimon Whiteson,et al. The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[20] Shimon Whiteson,et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[21] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[22] Fei Sha,et al. Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.