Improved Cooperative Multi-agent Reinforcement Learning Algorithm Augmented by Mixing Demonstrations from Centralized Policy

Many decision problems for complex systems that involve multiple decision makers can be formulated as a decentralized partially observable markov decision process (dec-POMDP) problem. Due to the computational difficulty with obtaining optimal policies, recent approaches to dec-POMDP often use a multi-agent reinforcement learning (MARL) algorithm. We propose a method to improve the existing cooperative MARL algorithms by adopting an imitation learning technique. For a reference policy in the imitation learning part, we use a centralized policy from a multi-agent MDP or a multi-agent POMDP model reduced from the original dec-POMDP model. In the proposed method, during the training process, we mix demonstrations from the reference policy by using a demonstration buffer. Demonstration samples from the buffer are used in the augmented policy gradient function for policy updates. We assess the performance of the proposed method for three well-known dec-POMDP benchmark problems -- Mars rover, co-operative box pushing, and dec-tiger. Experimental results indicate that augmenting the baseline MARL algorithm by mixing the demonstrations significantly improves the quality of policy solutions. With these results, we conclude that the imitation learning can enhance MARL algorithms and that policy solutions from MMDP and MPOMDP models are a reasonable reference policy to use in the proposed algorithm.

[1]  Shimon Whiteson,et al.  Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks , 2016, ArXiv.

[2]  Michael L. Littman,et al.  Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes , 1997, UAI.

[3]  Matthew E. Taylor,et al.  Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL , 2018, ArXiv.

[4]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[5]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[6]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[7]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[8]  Craig Boutilier,et al.  Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[9]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[10]  Goele Pipeleers,et al.  Flexible Multi-Agent System for Distributed Coordination, Transportation & Localisation , 2018, AAMAS.

[11]  François Charpillet,et al.  Mixed Integer Linear Programming for Exact Finite-Horizon Planning in Decentralized Pomdps , 2007, ICAPS.

[12]  Olivier Buffet,et al.  Optimally Solving Dec-POMDPs as Continuous-State MDPs , 2013, IJCAI.

[13]  Makoto Yokoo,et al.  Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings , 2003, IJCAI.

[14]  Jonathan P. How,et al.  Learning to Teach in Cooperative Multiagent Reinforcement Learning , 2018, AAAI.

[15]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[16]  Shimon Whiteson,et al.  Multi-Agent Common Knowledge Reinforcement Learning , 2018, NeurIPS.

[17]  Bikramjit Banerjee,et al.  Multi-agent reinforcement learning as a rehearsal for decentralized planning , 2016, Neurocomputing.

[18]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[19]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[20]  Huimin Ma,et al.  Pretraining Deep Actor-Critic Reinforcement Learning Algorithms With Expert Demonstrations , 2018, ArXiv.

[21]  Shlomo Zilberstein,et al.  Achieving goals in decentralized POMDPs , 2009, AAMAS.

[22]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[23]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[24]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[25]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[26]  M. Veloso,et al.  Multiagent Collaborative Task Learning through Imitation , 2007 .

[27]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[30]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[31]  François Charpillet,et al.  MAA*: A Heuristic Search Algorithm for Solving Decentralized POMDPs , 2005, UAI.

[32]  Jiashi Feng,et al.  Policy Optimization with Demonstrations , 2018, ICML.

[33]  Andrea Lockerd Thomaz,et al.  Exploration from Demonstration for Interactive Reinforcement Learning , 2016, AAMAS.

[34]  Jonathan P. How,et al.  Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability , 2017, ICML.

[35]  Shlomo Zilberstein,et al.  Improved Memory-Bounded Dynamic Programming for Decentralized POMDPs , 2007, UAI.

[36]  Yisong Yue,et al.  Generative Multi-Agent Behavioral Cloning , 2018, ArXiv.

[37]  Frans A. Oliehoek,et al.  Sufficient Plan-Time Statistics for Decentralized POMDPs , 2013, IJCAI.

[38]  Dorian Kodelja,et al.  Multiagent cooperation and competition with deep reinforcement learning , 2015, PloS one.

[39]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.

[40]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[41]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[42]  Shlomo Zilberstein,et al.  Dynamic Programming for Partially Observable Stochastic Games , 2004, AAAI.

[43]  Steven Reece,et al.  Human–agent collaboration for disaster response , 2015, Autonomous Agents and Multi-Agent Systems.

[44]  Frans A. Oliehoek,et al.  The MADP Toolbox: An Open-Source Library for Planning and Learning in (Multi-)Agent Systems , 2015, AAAI Fall Symposia.

[45]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[46]  Nezih Altay,et al.  OR/MS research in disaster operations management , 2006, Eur. J. Oper. Res..

[47]  Yisong Yue,et al.  Coordinated Multi-Agent Imitation Learning , 2017, ICML.

[48]  Daqiang Zhang,et al.  Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination , 2016, Comput. Networks.