Provable Hierarchy-Based Meta-Reinforcement Learning

Hierarchical reinforcement learning (HRL) has seen widespread interest as an approach to tractable learning of complex modular behaviors. However, existing work either assume access to expert-constructed hierarchies, or use hierarchylearning heuristics with no provable guarantees. To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task. We consider a tabular setting where natural hierarchical structure is embedded in the transition dynamics. Analogous to supervised meta-learning theory, we provide “diversity conditions” which, together with a tractable optimism-based algorithm, guarantee sampleefficient recovery of this natural hierarchy. Furthermore, we provide regret bounds on a learner using the recovered hierarchy to solve a meta-test task. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.

[1]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[2]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[3]  Doina Precup,et al.  On Efficiency in Hierarchical Reinforcement Learning , 2020, NeurIPS.

[4]  Kenneth O. Stanley,et al.  First return then explore , 2021, Nature.

[5]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[6]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[7]  Sridhar Mahadevan,et al.  Proto-value functions: developmental reinforcement learning , 2005, ICML.

[8]  Alessandro Lazaric,et al.  Exploration – Exploitation in MDPs with Options , 2016 .

[9]  Shie Mannor,et al.  Time-Regularized Interrupting Options (TRIO) , 2014, ICML.

[10]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[11]  Zeb Kurth-Nelson,et al.  Alchemy: A structured task distribution for meta-reinforcement learning , 2021, ArXiv.

[12]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[13]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[14]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[15]  Michael I. Jordan,et al.  On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[16]  Andrew G. Barto,et al.  Using relative novelty to identify useful temporal abstractions in reinforcement learning , 2004, ICML.

[17]  Lihong Li,et al.  PAC-inspired Option Discovery in Lifelong Reinforcement Learning , 2014, ICML.

[18]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[19]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[20]  Michal Valko,et al.  Episodic Reinforcement Learning in Finite MDPs: Minimax Lower Bounds Revisited , 2021, ALT.

[21]  Sergey Levine,et al.  Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings , 2018, ICML.

[22]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[23]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[24]  Sergey Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[25]  Marlos C. Machado,et al.  Eigenoption Discovery through the Deep Successor Representation , 2017, ICLR.

[26]  Pieter Abbeel,et al.  Meta Learning Shared Hierarchies , 2017, ICLR.

[27]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[28]  Shie Mannor,et al.  Scaling Up Approximate Value Iteration with Options: Better Policies with Fewer Iterations , 2014, ICML.

[29]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[30]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[31]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[32]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[33]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[34]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.