Advantage Constrained Proximal Policy Optimization in Multi-Agent Reinforcement Learning

We investigate the integration of value-based and policy gradient methods in multi-agent reinforcement learning (MARL). The Individual-Global-Max (IGM) principle plays an important role in value-based MARL, as it ensures consistency between joint and local action values. IGM is difficult to guarantee in multi-agent policy gradient methods due to stochastic exploration and conflicting gradient directions. In this paper, we propose a novel multi-agent policy gradient algorithm called Advantage Constrained Proximal Policy Optimization (ACPPO). ACPPO calculates each agent's current local state-action advantage based on their advantage network and estimates the joint state-action advantage based on multi-agent advantage decomposition lemma. According to the consistency of the estimated joint-action advantage and local advantage, the coefficient of each agent constrains the joint-action advantage. ACPPO, unlike previous policy gradient MARL algorithms, does not require an additional sampled baseline to reduce variance or a sequential scheme to improve accuracy. The proposed method is evaluated using the continuous matrix game, the Starcraft Multi-Agent Challenge, and the Multi-Agent MuJoCo task. ACPPO outperforms baselines such as MAPPO, MADDPG, and HATRPO, according to the results.

[1]  Jianye Hao,et al.  Event-Triggered Multi-agent Reinforcement Learning with Communication under Limited-bandwidth Constraint , 2020, ArXiv.

[2]  Dongbin Zhao,et al.  Empirical Policy Optimization for n-Player Markov Games , 2021, IEEE Transactions on Cybernetics.

[3]  Yaodong Yang,et al.  Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning , 2021, ICLR.

[4]  Dongbin Zhao,et al.  UNMAS: Multiagent Reinforcement Learning for Unshaped Cooperative Scenarios , 2021, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Yaodong Yang,et al.  Settling the Variance of Multi-Agent Policy Gradients , 2021, NeurIPS.

[6]  Baochang Zhang,et al.  Multi-UAV Mobile Edge Computing and Path Planning Platform Based on Reinforcement Learning , 2021, IEEE Transactions on Emerging Topics in Computational Intelligence.

[7]  Shimon Whiteson,et al.  Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? , 2020, ArXiv.

[8]  Chongjie Zhang,et al.  QPLEX: Duplex Dueling Multi-Agent Q-Learning , 2020, ICLR.

[9]  Shimon Whiteson,et al.  FACMAC: Factored Multi-Agent Centralised Policy Gradients , 2020, NeurIPS.

[10]  Philip H. S. Torr,et al.  Deep Multi-Agent Reinforcement Learning for Decentralized Continuous Cooperative Control , 2020, ArXiv.

[11]  Dongbin Zhao,et al.  A Survey of Deep Reinforcement Learning in Video Games , 2019, ArXiv.

[12]  Dongbin Zhao,et al.  Multi-Agent Reinforcement Learning Based on Clustering in Two-Player Games , 2019, 2019 IEEE Symposium Series on Computational Intelligence (SSCI).

[13]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[14]  Hao Wang,et al.  IntelligentCrowd: Mobile Crowdsensing via Multi-Agent Reinforcement Learning , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[15]  Wojciech Czarnecki,et al.  Multi-task Deep Reinforcement Learning with PopArt , 2018, AAAI.

[16]  Dongbin Zhao,et al.  StarCraft Micromanagement With Reinforcement Learning and Curriculum Transfer Learning , 2018, IEEE Transactions on Emerging Topics in Computational Intelligence.

[17]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[18]  Dongbin Zhao,et al.  Cooperative reinforcement learning for multiple units combat in starCraft , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[19]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[20]  Joel Z. Leibo,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning , 2017, ArXiv.

[21]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[22]  Hussein A. Abbass,et al.  Contrasting Human and Computational Intelligence Based Autonomous Behaviors in a Blue–Red Simulation Environment , 2017, IEEE Transactions on Emerging Topics in Computational Intelligence.

[23]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.