On the Role of Weight Sharing During Deep Option Learning

The options framework is a popular approach for building temporally extended actions in reinforcement learning. In particular, the option-critic architecture provides general purpose policy gradient theorems for learning actions from scratch that are extended in time. However, past work makes the key assumption that each of the components of option-critic has independent parameters. In this work we note that while this key assumption of the policy gradient theorems of option-critic holds in the tabular case, it is always violated in practice for the deep function approximation setting. We thus reconsider this assumption and consider more general extensions of option-critic and hierarchical option-critic training that optimize for the full architecture with each update. It turns out that not assuming parameter independence challenges a belief in prior work that training the policy over options can be disentangled from the dynamics of the underlying options. In fact, learning can be sped up by focusing the policy over options on states where options are actually likely to terminate. We put our new algorithms to the test in application to sample efficient learning of Atari games, and demonstrate significantly improved stability and faster convergence when learning long options.

[1]  Andrew G. Barto,et al.  Conjugate Markov Decision Processes , 2011, ICML.

[2]  Pierre-Yves Oudeyer,et al.  How Many Random Seeds? Statistical Power Analysis in Deep Reinforcement Learning Experiments , 2018, ArXiv.

[3]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4]  Philip S. Thomas,et al.  Reinforcement Learning Without Backpropagation or a Clock , 2019 .

[5]  Doina Precup,et al.  Learnings Options End-to-End for Continuous Action Tasks , 2017, ArXiv.

[6]  M. Riemer,et al.  Representation Stability as a Regularizer for Improved Text Analytics Transfer Learning , 2017, arXiv.org.

[7]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[8]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[11]  Quoc V. Le,et al.  Diversity and Depth in Per-Example Routing Models , 2018, ICLR.

[12]  Nahum Shimkin,et al.  Unified Inter and Intra Options Learning Using Policy Gradient Methods , 2011, EWRL.

[13]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Ignacio Cases,et al.  Routing Networks and the Challenges of Modular and Compositional Computation , 2019, ArXiv.

[15]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[16]  Philip S. Thomas,et al.  Policy Gradient Coagent Networks , 2011, NIPS.

[17]  Sophia Krasikov,et al.  A Deep Learning and Knowledge Transfer Based Architecture for Social Media User Characteristic Determination , 2015, SocialNLP@NAACL.

[18]  Razvan Pascanu,et al.  Progressive Neural Networks , 2016, ArXiv.

[19]  Gerald Tesauro,et al.  Learning to Learn without Forgetting By Maximizing Transfer and Minimizing Interference , 2018, ICLR.

[20]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[21]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[22]  Gerald Tesauro,et al.  Learning Abstract Options , 2018, NeurIPS.

[23]  Djallel Bouneffouf,et al.  Scalable Recollections for Continual Lifelong Learning , 2017, AAAI.

[24]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[25]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[26]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[27]  M. Franceschini,et al.  Generative Knowledge Distillation for General Purpose Function Compression , 2017 .

[28]  Philip S. Thomas,et al.  Asynchronous Coagent Networks: Stochastic Networks for Reinforcement Learning without Backpropagation or a Clock , 2019, ICML.

[29]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[30]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[31]  Anthony V. Robins,et al.  Catastrophic Forgetting, Rehearsal and Pseudorehearsal , 1995, Connect. Sci..

[32]  Jiwon Kim,et al.  Continual Learning with Deep Generative Replay , 2017, NIPS.

[33]  Pieter Abbeel,et al.  On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient , 2010, NIPS.

[34]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[35]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Christopher Potts,et al.  Recursive Routing Networks: Learning to Compose Modules for Language Understanding , 2019, NAACL.

[37]  Joachim Bingel,et al.  Sluice networks: Learning what to share between loosely related tasks , 2017, ArXiv.

[38]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[39]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[40]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[41]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[42]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[43]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.