Multi-Agent Trust Region Policy Optimization

We extend trust region policy optimization (TRPO) to multi-agent reinforcement learning (MARL) problems. We show that the policy update of TRPO can be transformed into a distributed consensus optimization problem for multi-agent cases. By making a series of approximations to the consensus optimization model, we propose a decentralized MARL algorithm, which we call multi-agent TRPO (MATRPO). This algorithm can optimize distributed policies based on local observations and private rewards. The agents do not need to know observations, rewards, policies or value/action-value functions of other agents. The agents only share a likelihood ratio with their neighbors during the training process. The algorithm is fully decentralized and privacy-preserving. Our experiments on two cooperative games demonstrate its robust performance on complicated MARL tasks.

[1]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[2]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[3]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[4]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[5]  Liangjun Ke,et al.  Large-Scale Traffic Signal Control Using a Novel Multiagent Reinforcement Learning , 2019, IEEE Transactions on Cybernetics.

[6]  Zhuoran Yang,et al.  Multi-Agent Reinforcement Learning via Double Averaging Primal-Dual Optimization , 2018, NeurIPS.

[7]  Nicholas R. Jennings,et al.  On Agent-Mediated Electronic Commerce , 2003, IEEE Trans. Knowl. Data Eng..

[8]  J. Such,et al.  A survey of privacy in multi-agent systems , 2013, The Knowledge Engineering Review.

[9]  Hongyuan Zha,et al.  F2A2: Flexible Fully-decentralized Approximate Actor-critic for Cooperative Multi-agent Reinforcement Learning , 2020, ArXiv.

[10]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[11]  Xiangyu Liu,et al.  Attentive Relational State Representation in Decentralized Multiagent Reinforcement Learning , 2020, IEEE Transactions on Cybernetics.

[12]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[13]  Naira Hovakimyan,et al.  Primal-Dual Algorithm for Distributed Reinforcement Learning: Distributed GTD , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[14]  Vivek S. Borkar,et al.  Distributed Reinforcement Learning via Gossip , 2013, IEEE Transactions on Automatic Control.

[15]  Ali H. Sayed,et al.  Distributed Policy Evaluation Under Multiple Behavior Strategies , 2013, IEEE Transactions on Automatic Control.

[16]  Carlo Fischione,et al.  A distributed approach for the optimal power flow problem , 2014, 2016 European Control Conference (ECC).

[17]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[18]  Peng Peng,et al.  Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[19]  M. Pipattanasomporn,et al.  Multi-agent systems in a distributed smart grid: Design and implementation , 2009, 2009 IEEE/PES Power Systems Conference and Exposition.

[20]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[21]  H. Vincent Poor,et al.  QD-Learning: A Collaborative Distributed Strategy for Multi-Agent Reinforcement Learning Through Consensus + Innovations , 2012, IEEE Trans. Signal Process..

[22]  Saeid Nahavandi,et al.  Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications , 2018, IEEE Transactions on Cybernetics.

[23]  Sahin Albayrak,et al.  An agent-based approach for privacy-preserving recommender systems , 2007, AAMAS '07.

[24]  Thinh T. Doan,et al.  Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation on Multi-Agent Reinforcement Learning , 2019, ICML.

[25]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[26]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[27]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[28]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[29]  Asuman E. Ozdaglar,et al.  On the O(1=k) convergence of asynchronous distributed alternating Direction Method of Multipliers , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[30]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[31]  Yan Zhang,et al.  Distributed off-Policy Actor-Critic Reinforcement Learning with Policy Consensus , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[32]  Tamer Basar,et al.  Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents , 2018, ICML.

[33]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[34]  Qichao Zhang,et al.  Reinforcement Learning and Deep Learning based Lateral Control for Autonomous Driving , 2018, IEEE Comput. Intell. Mag..