Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies

We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph which describes a set of subtasks and their dependencies that are unknown to the agent. The agent needs to quickly adapt to the task over few episodes during adaptation phase to maximize the return in the test phase. Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference(MSGI), which infers the latent parameter of the task by interacting with the environment and maximizes the return given the latent parameter. To facilitate learning, we adopt an intrinsic reward inspired by upper confidence bound (UCB) that encourages efficient exploration. Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter, and to adapt more efficiently than existing meta RL and hierarchical RL methods.

[1]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[2]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[3]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[6]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[7]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[8]  Kutluhan Erol,et al.  Hierarchical task network planning: formalization, analysis, and implementation , 1996 .

[9]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[10]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Matthew E. Taylor,et al.  Autonomous Extracting a Hierarchical Structure of Tasks in Reinforcement Learning and Multi-task Reinforcement Learning , 2017, ArXiv.

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Brian Scassellati,et al.  Autonomously constructing hierarchical task networks for planning and human-robot collaboration , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Chong Wang,et al.  Neural Logic Machines , 2019, ICLR.

[16]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[17]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[18]  Vítor Santos Costa,et al.  Inductive Logic Programming , 2013, Lecture Notes in Computer Science.

[19]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[20]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[21]  Eric P. Xing,et al.  Harnessing Deep Neural Networks with Logic Rules , 2016, ACL.

[22]  Honglak Lee,et al.  Hierarchical Reinforcement Learning for Zero-shot Generalization with Subtask Dependencies , 2018, NeurIPS.

[23]  Wei Xu,et al.  A Deep Compositional Framework for Human-like Language Acquisition in Virtual Environment , 2017, ArXiv.

[24]  Song-Chun Zhu,et al.  Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration , 2016, EMNLP.

[25]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[26]  Silvio Savarese,et al.  Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[28]  Silvio Savarese,et al.  Neural Task Programming: Learning to Generalize Across Hierarchical Tasks , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[29]  Craig Boutilier,et al.  Exploiting Structure in Policy Construction , 1995, IJCAI.

[30]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[31]  Ali Farhadi,et al.  AI2-THOR: An Interactive 3D Environment for Visual AI , 2017, ArXiv.

[32]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[33]  Austin Tate,et al.  Generating Project Networks , 1977, IJCAI.

[34]  Honglak Lee,et al.  Zero-Shot Task Generalization with Multi-Task Deep Reinforcement Learning , 2017, ICML.

[35]  Romain Laroche,et al.  Hybrid Reward Architecture for Reinforcement Learning , 2017, NIPS.

[36]  Sergey Levine,et al.  Meta-Reinforcement Learning of Structured Exploration Strategies , 2018, NeurIPS.

[37]  Dale Schuurmans,et al.  Direct value-approximation for factored MDPs , 2001, NIPS.

[38]  Andrew G. Barto,et al.  Causal Graph Based Decomposition of Factored MDPs , 2006, J. Mach. Learn. Res..

[39]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[40]  Richard Evans,et al.  Learning Explanatory Rules from Noisy Data , 2017, J. Artif. Intell. Res..

[41]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.