Learning Hierarchical Teaching in Cooperative Multiagent Reinforcement Learning

Heterogeneous knowledge naturally arises among different agents in cooperative multiagent reinforcement learning. As such, learning can be greatly improved if agents can effectively pass their knowledge on to other agents. Existing work has demonstrated that peer-to-peer knowledge transfer, a process referred to as action advising, improves team-wide learning. In contrast to previous frameworks that advise at the level of primitive actions, we aim to learn high-level teaching policies that decide when and what high-level action (e.g., sub-goal) to advise a teammate. We introduce a new learning to teach framework, called hierarchical multiagent teaching (HMAT). The proposed framework solves difficulties faced by prior work on multiagent teaching when operating in domains with long horizons, delayed rewards, and continuous states/actions by leveraging temporal abstraction and deep function approximation. Our empirical evaluations show that HMAT accelerates team-wide learning progress in difficult environments that are more complex than those explored in previous work. HMAT also learns teaching policies that can be transferred to different teammates/tasks and can even teach teammates with heterogeneous action spaces.

[1]  Gerald Tesauro,et al.  Learning Abstract Options , 2018, NeurIPS.

[2]  Ofra Amir,et al.  Interactive Teaching Strategies for Agent Training , 2016, IJCAI.

[3]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  Yulia Tsvetkov,et al.  Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning , 2016, ACL.

[6]  Nan Jiang,et al.  Hierarchical Imitation and Reinforcement Learning , 2018, ICML.

[7]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[8]  E. Rogers,et al.  Diffusion of innovations , 1964, Encyclopedia of Sport Management.

[9]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[12]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[13]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[14]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Matthew E. Taylor,et al.  Teaching on a budget: agents advising agents in reinforcement learning , 2013, AAMAS.

[17]  Qiang Liu,et al.  Learning to Explore with Meta-Policy Gradient , 2018, ICML 2018.

[18]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[19]  Jonathan P. How,et al.  Learning to Teach in Cooperative Multiagent Reinforcement Learning , 2018, AAAI.

[20]  Felipe Leno da Silva,et al.  Simultaneously Learning and Advising in Multiagent Reinforcement Learning , 2017, AAMAS.

[21]  Paul E. Utgoff,et al.  On integrating apprentice learning and reinforcement learning , 1996 .

[22]  Marcus Hutter,et al.  Reinforcement learning with value advice , 2014, ACML.

[23]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[24]  J. Stenton Learning how to teach. , 1973, Nursing mirror and midwives journal.

[25]  Frans A. Oliehoek,et al.  A Concise Introduction to Decentralized POMDPs , 2016, SpringerBriefs in Intelligent Systems.

[26]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[27]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[28]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[29]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[30]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[31]  Yisong Yue,et al.  Coordinated Multi-Agent Imitation Learning , 2017, ICML.

[32]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[33]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[34]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[35]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[36]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[37]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.