Meta Learning Shared Hierarchies

We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps. Specifically, a set of primitives are shared within a distribution of tasks, and are switched between by task-specific policies. We provide a concrete metric for measuring the strength of such hierarchies, leading to an optimization problem for quickly reaching high reward on unseen tasks. We then present an algorithm to solve this problem end-to-end through the use of any off-the-shelf reinforcement learning method, by repeatedly sampling new tasks and resetting task-specific policies. We successfully discover meaningful motor primitives for the directional movement of four-legged robots, solely by interacting with distributions of mazes. We also demonstrate the transferability of primitives to solve long-timescale sparse-reward obstacle courses, and we enable 3D humanoid robots to robustly walk and crawl with the same policy.

[1]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[2]  Andrew G. Barto,et al.  Motor primitive discovery , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[3]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[4]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[5]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[6]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[7]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[8]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[9]  Matthew E. Taylor,et al.  Autonomous Extracting a Hierarchical Structure of Tasks in Reinforcement Learning and Multi-task Reinforcement Learning , 2017, ArXiv.

[10]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[11]  Andrew G. Barto,et al.  Conjugate Markov Decision Processes , 2011, ICML.

[12]  Joelle Pineau,et al.  OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning , 2017, AAAI.

[13]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[16]  Alex Graves,et al.  Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[17]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[18]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[19]  Yuval Tassa,et al.  Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[20]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[21]  Pieter Abbeel,et al.  Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[22]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[23]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[24]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[25]  Philip S. Thomas,et al.  Policy Gradient Coagent Networks , 2011, NIPS.