Learning Abstract Options

Building systems that autonomously create temporal abstractions from data is a key challenge in scaling learning and planning in reinforcement learning. One popular approach for addressing this challenge is the options framework (Sutton et al., 1999). However, only recently in (Bacon et al., 2017) was a policy gradient theorem derived for online learning of general purpose options in an end to end fashion. In this work, we extend previous work on this topic that only focuses on learning a two-level hierarchy including options and primitive actions to enable learning simultaneously at multiple resolutions in time. We achieve this by considering an arbitrarily deep hierarchy of options where high level temporally extended options are composed of lower level options with finer resolutions in time. We extend results from (Bacon et al., 2017) and derive policy gradient theorems for a deep hierarchy of options. Our proposed hierarchical option-critic architecture is capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals. Our empirical results in both discrete and continuous environments demonstrate the efficiency of our framework.

[1]  M. Franceschini,et al.  Generative Knowledge Distillation for General Purpose Function Compression , 2017 .

[2]  Scott Niekum,et al.  Semantically Grounded Learning from Unstructured Demonstrations , 2013 .

[3]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[4]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[5]  David Silver,et al.  Compositional Planning Using Optimal Option Models , 2012, ICML.

[6]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[7]  Shie Mannor,et al.  Q-Cut - Dynamic Discovery of Sub-goals in Reinforcement Learning , 2002, ECML.

[8]  Jan Peters,et al.  Probabilistic inference for determining options in reinforcement learning , 2016, Machine Learning.

[9]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[10]  Ion Stoica,et al.  Multi-Level Discovery of Deep Options , 2017, ArXiv.

[11]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Nahum Shimkin,et al.  Unified Inter and Intra Options Learning Using Policy Gradient Methods , 2011, EWRL.

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[16]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[17]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[18]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[19]  Doina Precup,et al.  Learnings Options End-to-End for Continuous Action Tasks , 2017, ArXiv.

[20]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[21]  Sophia Krasikov,et al.  A Deep Learning and Knowledge Transfer Based Architecture for Social Media User Characteristic Determination , 2015, SocialNLP@NAACL.

[22]  Alex Graves,et al.  Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[23]  Andrew G. Barto,et al.  Skill Characterization Based on Betweenness , 2008, NIPS.

[24]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[25]  Balaraman Ravindran,et al.  Learning to Multi-Task by Active Sampling , 2017, ICLR.

[26]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[27]  Kate Saenko,et al.  Hierarchical Actor-Critic , 2017, ArXiv.

[28]  Shie Mannor,et al.  Approximate Value Iteration with Temporally Extended Actions , 2015, J. Artif. Intell. Res..

[29]  Scott Kuindersma,et al.  Autonomous Skill Acquisition on a Mobile Manipulator , 2011, AAAI.

[30]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[31]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[32]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[33]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[34]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[35]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[37]  M. Riemer,et al.  Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning , 2017, arXiv.org.

[38]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[39]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[40]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[41]  Richard Socher,et al.  Hierarchical and Interpretable Skill Acquisition in Multi-task Reinforcement Learning , 2017, ICLR.

[42]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[43]  Saurabh Kumar,et al.  Learning to Compose Skills , 2017, ArXiv.