论文信息 - Trust Region Bounds for Decentralized PPO Under Non-stationarity - 字舞流文

Trust Region Bounds for Decentralized PPO Under Non-stationarity

We present trust region bounds for optimizing decentralized policies in cooperative Multi-Agent Reinforcement Learning (MARL), which holds even when the transition dynamics are non-stationary. This new analysis provides a theoretical understanding of the strong performance of two recent actor-critic methods for MARL, which both rely on independent ratios, i.e., computing probability ratios separately for each agent's policy. We show that, despite the non-stationarity that independent ratios cause, a monotonic improvement guarantee still arises as a result of enforcing the trust region constraint over all decentralized policies. We also show this trust region constraint can be effectively enforced in a principled way by bounding independent ratios based on the number of agents in training, providing a theoretical foundation for proximal ratio clipping. Finally, our empirical results support the hypothesis that the strong performance of IPPO and MAPPO is a direct result of enforcing such a trust region constraint via clipping in centralized training, and tuning the hyperparameters with regards to the number of agents, as predicted by our theoretical analysis.

Sam Devlin | Katja Hofmann | Shimon Whiteson | Jacob Beck | Mingfei Sun

[1] Yaodong Yang,et al. Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning , 2021, ICLR.

[2] Ying Wen,et al. A Game-Theoretic Approach to Multi-Agent Trust Region Optimization , 2021, DAI.

[3] Yu Wang,et al. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games , 2021, NeurIPS.

[4] Christopher Amato,et al. Contrasting Centralized and Decentralized Critics in Multi-Agent Reinforcement Learning , 2021, AAMAS.

[5] Shimon Whiteson,et al. Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? , 2020, ArXiv.

[6] Haibo He,et al. Multi-Agent Trust Region Policy Optimization , 2020, IEEE transactions on neural networks and learning systems.

[7] Filippos Christianos,et al. Dealing with Non-Stationarity in Multi-Agent Deep Reinforcement Learning , 2019, ArXiv.

[8] Shimon Whiteson,et al. The StarCraft Multi-Agent Challenge , 2019, AAMAS.

[9] Shimon Whiteson,et al. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[10] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[11] Joel Z. Leibo,et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning , 2017, ArXiv.

[12] Yi Wu,et al. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[13] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[14] Shimon Whiteson,et al. Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[15] Mykel J. Kochenderfer,et al. Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[16] Frans A. Oliehoek,et al. A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[17] Michael I. Jordan,et al. Trust Region Policy Optimization , 2015, ICML.

[18] Sean Luke,et al. Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[19] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[20] Hongyuan Zha,et al. Dealing with Non-Stationarity in Multi-Agent Reinforcement Learning via Trust Region Decomposition , 2021, ArXiv.

[21] Vijay R. Konda,et al. Actor-Critic Algorithms , 1999, NIPS.