Trust Region Bounds for Decentralized PPO Under Non-stationarity

We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.

[1]  Yaodong Yang,et al.  Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning , 2021, ICLR.

[2]  Ying Wen,et al.  A Game-Theoretic Approach to Multi-Agent Trust Region Optimization , 2021, DAI.

[3]  Yu Wang,et al.  The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games , 2021, NeurIPS.

[4]  Christopher Amato,et al.  Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforcement Learning , 2021, AAMAS.

[5]  Shimon Whiteson,et al.  Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? , 2020, ArXiv.

[6]  Haibo He,et al.  Multi-Agent Trust Region Policy Optimization , 2020, IEEE transactions on neural networks and learning systems.

[7]  Filippos Christianos,et al.  Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning , 2019, ArXiv.

[8]  Shimon Whiteson,et al.  The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[9]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[10]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[11]  Joel Z. Leibo,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning , 2017, ArXiv.

[12]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[13]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[14]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[15]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[16]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[17]  Michael I. Jordan,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[19]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[20]  Hongyuan Zha,et al.  Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition , 2021, ArXiv.

[21]  Vijay R. Konda,et al.  Actor-Critic Algorithms , 1999, NIPS.