Improving sample efficiency in Multi-Agent Actor-Critic methods

The popularity of multi-agent deep reinforcement learning (MADRL) is growing rapidly with the demand for large-scale real-world tasks that require swarm intelligence, and many studies have improved MADRL from the perspective of network structures or reinforcement learning methods. However, the application of MADRL in the real world is hampered by the low sample efficiency of the models and the high cost to collect data. To improve the practicability, an extension to the current training paradigm of MADRL that improves the sample efficiency is imperative. To this end, this paper proposes PEDMA, a flexible plugin unit for MADRL. It consists of three techniques: (i)Parallel Environments (PE), to accelerate the data acquisition; (ii)Experience Augmentation (EA), a novel data augmentation method that utilizes the permutation invariance property of the multi-agent system to reduce the cost of acquiring data; and (iii)Delayed Updated Policies (DUP), to improve the data utilization efficiency of the MADRL model. The proposed EA method could improve the performance, data efficiency, and convergence speed of MADRL models, which is theoretically and empirically demonstrated. Experiments on three multi-agent benchmark tasks show that the MAAC model trained with PEDMA outperforms the baselines and state-of-the-art algorithms, and ablation studies show the contribution and necessity of each component in PEDMA.

[1]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[2]  Masashi Sugiyama,et al.  Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics , 2019, ArXiv.

[3]  Pieter Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[4]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Pieter Abbeel,et al.  Emergence of Grounded Compositional Language in Multi-Agent Populations , 2017, AAAI.

[7]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[8]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[9]  Vikram Manikonda,et al.  A multi-agent approach to cooperative traffic management and route guidance , 2005 .

[10]  Yun Yang,et al.  A Multi-Agent Framework for Packet Routing in Wireless Sensor Networks , 2015, Sensors.

[11]  Taisuke Kobayashi,et al.  Bottom-up multi-agent reinforcement learning by reward shaping for cooperative-competitive tasks , 2021, Applied Intelligence.

[12]  S. Shankar Sastry,et al.  Probabilistic pursuit-evasion games: theory, implementation, and experimental evaluation , 2002, IEEE Trans. Robotics Autom..

[13]  Chi Harold Liu,et al.  Distributed Energy-Efficient Multi-UAV Navigation for Long-Term Communication Coverage by Deep Reinforcement Learning , 2020, IEEE Transactions on Mobile Computing.

[14]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[15]  Yujing Hu,et al.  Multi-Agent Game Abstraction via Graph Attention Neural Network , 2019, AAAI.

[16]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[17]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[19]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[20]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[22]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[24]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[25]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Alexander G. Schwing,et al.  PIC: Permutation Invariant Critic for Multi-Agent Deep Reinforcement Learning , 2019, CoRL.

[29]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[30]  R. Bellman Dynamic programming. , 1957, Science.

[31]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[32]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[33]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[34]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[35]  Yadong Liu,et al.  GAMA: Graph Attention Multi-agent reinforcement learning algorithm for cooperation , 2020, Applied Intelligence.

[36]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[37]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[38]  R. Bellman Dynamic Programming , 1957, Science.

[39]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[40]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.