Promoting Coordination through Policy Regularization in Multi-Agent Reinforcement Learning

In multi-agent reinforcement learning, discovering successful collective behaviors is challenging as it requires exploring a joint action space that grows exponentially with the number of agents. While the tractability of independent agent-wise exploration is appealing, this approach fails on tasks that require elaborate group strategies. We argue that coordinating the agents' policies can guide their exploration and we investigate techniques to promote such an inductive bias. We propose two policy regularization methods: TeamReg, which is based on inter-agent action predictability and CoachReg that relies on synchronized behavior selection. We evaluate each approach on four challenging continuous control tasks with sparse rewards that require varying levels of coordination. Our methodology allocates the same hyper-parameter search budget across our algorithms and baselines and we find that our approaches are more robust to hyper-parameter variations. Our experiments show that our methods significantly improve performance on cooperative multi-agent problems and scale well when the number of agents is increased. Finally, we quantitatively analyze the effects of our proposed methods on the policies that our agents learn and we show that our methods successfully enforce the qualities that we propose as proxies for coordinated behaviors.

[1]  H. Francis Song,et al.  Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[2]  Liang Lin,et al.  NADPEx: An on-policy temporally consistent exploration method for deep reinforcement learning , 2018, ICLR.

[3]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[4]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[5]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[6]  Jordan L. Boyd-Graber,et al.  Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[7]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[8]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[9]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[10]  Peter Dayan,et al.  Feudal Multi-Agent Hierarchies for Cooperative Reinforcement Learning , 2019, ICLR 2019.

[11]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[12]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[13]  Joshua B. Tenenbaum,et al.  Learning to Share and Hide Intentions using Information Regularization , 2018, NeurIPS.

[14]  Olivier Bachem,et al.  Google Research Football: A Novel Reinforcement Learning Environment , 2020, AAAI.

[15]  Nando de Freitas,et al.  Intrinsic Social Motivation via Causal Influence in Multi-Agent RL , 2018, ArXiv.

[16]  Matthew E. Taylor,et al.  Agent Modeling as Auxiliary Task for Deep Reinforcement Learning , 2019, AIIDE.

[17]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[18]  Ben Poole,et al.  Categorical Reparametrization with Gumble-Softmax , 2017, ICLR 2017.

[19]  Zhang-Wei Hong,et al.  A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[20]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[21]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[24]  Nicholas R. Waytowich,et al.  Measuring collaborative emergent behavior in multi-agent reinforcement learning , 2018, IHSED.

[25]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[26]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[27]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[30]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[31]  Pieter Spronck,et al.  Opponent Modeling in Real-Time Strategy Games , 2007, GAMEON.

[32]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[33]  Zongqing Lu,et al.  Learning Attentional Communication for Multi-Agent Cooperation , 2018, NeurIPS.

[34]  Alexander Peysakhovich,et al.  Multi-Agent Cooperation and the Emergence of (Natural) Language , 2016, ICLR.

[35]  Nando de Freitas,et al.  Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning , 2018, ICML.

[36]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[37]  G. Uhlenbeck,et al.  On the Theory of the Brownian Motion , 1930 .