论文信息 - Advantage Constrained Proximal Policy Optimization in Multi-Agent Reinforcement Learning

Advantage Constrained Proximal Policy Optimization in Multi-Agent Reinforcement Learning

We investigate the integration of value-based and policy gradient methods in multi-agent reinforcement learning (MARL). The Individual-Global-Max (IGM) principle plays an important role in value-based MARL, as it ensures consistency between joint and local action values. IGM is difficult to guarantee in multi-agent policy gradient methods due to stochastic exploration and conflicting gradient directions. In this paper, we propose a novel multi-agent policy gradient algorithm called Advantage Constrained Proximal Policy Optimization (ACPPO). ACPPO calculates each agent's current local state-action advantage based on their advantage network and estimates the joint state-action advantage based on multi-agent advantage decomposition lemma. According to the consistency of the estimated joint-action advantage and local advantage, the coefficient of each agent constrains the joint-action advantage. ACPPO, unlike previous policy gradient MARL algorithms, does not require an additional sampled baseline to reduce variance or a sequential scheme to improve accuracy. The proposed method is evaluated using the continuous matrix game, the Starcraft Multi-Agent Challenge, and the Multi-Agent MuJoCo task. ACPPO outperforms baselines such as MAPPO, MADDPG, and HATRPO, according to the results.

Dongbin Zhao | Yuanheng Zhu | Weifan Li

[1] Jianye Hao,et al. Event-Triggered Multi-agent Reinforcement Learning with Communication under Limited-bandwidth Constraint , 2020, ArXiv.

[2] Dongbin Zhao,et al. Empirical Policy Optimization for n-Player Markov Games , 2021, IEEE Transactions on Cybernetics.

[3] Yaodong Yang,et al. Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning , 2021, ICLR.

[4] Dongbin Zhao,et al. UNMAS: Multiagent Reinforcement Learning for Unshaped Cooperative Scenarios , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[5] Yaodong Yang,et al. Settling the Variance of Multi-Agent Policy Gradients , 2021, NeurIPS.

[6] Baochang Zhang,et al. Multi-UAV Mobile Edge Computing and Path Planning Platform Based on Reinforcement Learning , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[7] Shimon Whiteson,et al. Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? , 2020, ArXiv.

[8] Chongjie Zhang,et al. QPLEX: Duplex Dueling Multi-Agent Q-Learning , 2020, ICLR.

[9] Shimon Whiteson,et al. FACMAC: Factored Multi-Agent Centralised Policy Gradients , 2020, NeurIPS.

[10] Philip H. S. Torr,et al. Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control , 2020, ArXiv.

[11] Dongbin Zhao,et al. A Survey of Deep Reinforcement Learning in Video Games , 2019, ArXiv.

[12] Dongbin Zhao,et al. Multi-Agent Reinforcement Learning Based on Clustering in Two-Player Games , 2019, 2019 IEEE Symposium Series on Computational Intelligence (SSCI).

[13] Shimon Whiteson,et al. The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[14] Hao Wang,et al. IntelligentCrowd: Mobile Crowdsensing via Multi-Agent Reinforcement Learning , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[15] Wojciech Czarnecki,et al. Multi-task Deep Reinforcement Learning with PopArt , 2018, AAAI.

[16] Dongbin Zhao,et al. StarCraft Micromanagement With Reinforcement Learning and Curriculum Transfer Learning , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[17] Shimon Whiteson,et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[18] Dongbin Zhao,et al. Cooperative reinforcement learning for multiple units combat in starCraft , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[19] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[20] Joel Z. Leibo,et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning , 2017, ArXiv.

[21] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[22] Hussein A. Abbass,et al. Contrasting Human and Computational Intelligence Based Autonomous Behaviors in a Blue–Red Simulation Environment , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[23] Michael I. Jordan,et al. Trust Region Policy Optimization , 2015, ICML.

[24] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.