Deep Decentralized Multi-task Multi-Agent Reinforcement Learning under Partial Observability

Many real-world tasks involve multiple agents with partial observability and limited communication. Learning is challenging in these settings due to local viewpoints of agents, which perceive the world as non-stationary due to concurrently-exploring teammates. Approaches that learn specialized policies for individual tasks face problems when applied to the real world: not only do agents have to learn and store distinct policies for each task, but in practice identities of tasks are often non-observable, making these approaches inapplicable. This paper formalizes and addresses the problem of multi-task multi-agent reinforcement learning under partial observability. We introduce a decentralized single-task learning approach that is robust to concurrent interactions of teammates, and present an approach for distilling single-task policies into a unified policy that performs well across multiple related tasks, without explicit provision of task identity.

[1]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[2]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[5]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[6]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[7]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[8]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[9]  Kee-Eung Kim,et al.  Learning to Cooperate via Policy Search , 2000, UAI.

[10]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[11]  Olivier Buffet,et al.  Multi-Agent Systems by Incremental Gradient Reinforcement Learning , 2001, IJCAI.

[12]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[13]  Daniel Kudenko,et al.  Reinforcement learning of coordination in cooperative multi-agent systems , 2002, AAAI/IAAI.

[14]  Masayuki Yamamura,et al.  Multitask reinforcement learning on the distribution of MDPs , 2003, Proceedings 2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation. Computational Intelligence in Robotics and Automation for the New Millennium (Cat. No.03EX694).

[15]  Tanaka Fumihide,et al.  Multitask Reinforcement Learning on the Distribution of MDPs , 2003 .

[16]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[17]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[18]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[19]  Alan Fern,et al.  Multi-task reinforcement learning: a hierarchical Bayesian approach , 2007, ICML '07.

[20]  Dan Ventura,et al.  Predicting and Preventing Coordination Problems in Cooperative Q-learning Systems , 2007, IJCAI.

[21]  Guillaume J. Laurent,et al.  Hysteretic q-learning :an algorithm for decentralized reinforcement learning in cooperative multi-agent teams , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Nikos A. Vlassis,et al.  Optimal and Approximate Q-value Functions for Decentralized POMDPs , 2008, J. Artif. Intell. Res..

[23]  Alan Fern,et al.  Learning and transferring roles in multi-agent MDPs , 2008, AAAI 2008.

[24]  Alan Fern,et al.  Learning and Transferring Roles in Multi-Agent Reinforcement , 2008 .

[25]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[26]  Shlomo Zilberstein,et al.  Incremental Policy Generation for Finite-Horizon DEC-POMDPs , 2009, ICAPS.

[27]  Feng Wu,et al.  Rollout Sampling Policy Iteration for Decentralized POMDPs , 2010, UAI.

[28]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[29]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[30]  Lakhmi C. Jain,et al.  Innovations in Multi-Agent Systems and Applications - 1 , 2010 .

[31]  N. Le Fort-Piat,et al.  The world of independent learners is not markovian , 2011, Int. J. Knowl. Based Intell. Eng. Syst..

[32]  Bikramjit Banerjee,et al.  Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs , 2012, AAAI.

[33]  Wei Zhang,et al.  Multiagent-Based Reinforcement Learning for Optimal Reactive Power Dispatch , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[34]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[35]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[36]  Feng Wu,et al.  Monte-Carlo Expectation Maximization for Decentralized POMDPs , 2013, IJCAI.

[37]  Siobhán Clarke,et al.  Transfer learning in multi-agent systems through parallel transfer , 2013 .

[38]  Lihong Li,et al.  Sample Complexity of Multi-task Reinforcement Learning , 2013, UAI.

[39]  Panagiotis Tzionas,et al.  A robust approach for multi-agent natural resource allocation based on stochastic optimization algorithms , 2014, Appl. Soft Comput..

[40]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[41]  Jonathan P. How,et al.  Stick-Breaking Policy Learning in Dec-POMDPs , 2015, IJCAI.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[44]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[45]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[46]  Shimon Whiteson,et al.  Learning to Communicate to Solve Riddles with Deep Distributed Recurrent Q-Networks , 2016, ArXiv.

[47]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[48]  Jonathan P. How,et al.  Learning for Decentralized Control of Multiagent Systems in Large, Partially-Observable Stochastic Environments , 2016, AAAI.

[49]  Shimon Whiteson,et al.  Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning , 2017, ICML.