Compositional Transfer in Hierarchical Reinforcement Learning

The successful application of general reinforcement learning algorithms to real-world robotics applications is often limited by their high data requirements. We introduce Regularized Hierarchical Policy Optimization (RHPO) to improve data-efficiency for domains with multiple dominant tasks and ultimately reduce required platform time. To this end, we employ compositional inductive biases on multiple levels and corresponding mechanisms for sharing off-policy transition data across low-level controllers and tasks as well as scheduling of tasks. The presented algorithm enables stable and fast learning for complex, real-world domains in the parallel multitask and sequential transfer case. We show that the investigated types of hierarchy enable positive transfer while partially mitigating negative interference and evaluate the benefits of additional incentives for efficient, compositional task solutions in single task domains. Finally, we demonstrate substantial data-efficiency and final performance gains over competitive baselines in a week-long, physical robot stacking experiment.

[1]  John R. Anderson,et al.  The Transfer of Cognitive Skill , 1989 .

[2]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[3]  C. Bishop Mixture density networks , 1994 .

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[6]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[7]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[8]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[10]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[11]  Emilio Soria Olivas,et al.  Handbook of Research on Machine Learning Applications and Trends : Algorithms , Methods , and Techniques , 2009 .

[12]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[13]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[14]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[15]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[16]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[17]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[20]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[21]  Yuval Tassa,et al.  Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[22]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[23]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[24]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[25]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[26]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[27]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[28]  Pieter Abbeel,et al.  Mutual Alignment Transfer Learning , 2017, CoRL.

[29]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[30]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[31]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[32]  Gerald Tesauro,et al.  Learning Abstract Options , 2018, NeurIPS.

[33]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[34]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[35]  Shimon Whiteson,et al.  TACO: Learning Task Decomposition via Temporal Alignment for Control , 2018, ICML.

[36]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[37]  Joelle Pineau,et al.  An Inference-Based Policy Gradient Method for Learning Options , 2018, ICML.

[38]  Doina Precup,et al.  When Waiting is not an Option : Learning Options with a Deliberation Cost , 2017, AAAI.

[39]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[40]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[41]  Sergey Levine,et al.  Latent Space Policies for Hierarchical Reinforcement Learning , 2018, ICML.

[42]  Doina Precup,et al.  The Termination Critic , 2019, AISTATS.

[43]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[44]  Jaime G. Carbonell,et al.  Characterizing and Avoiding Negative Transfer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yee Whye Teh,et al.  Exploiting Hierarchy for Learning and Transfer in KL-regularized RL , 2019, ArXiv.

[46]  Sergio Gomez Colmenarejo,et al.  TF-Replicator: Distributed Machine Learning for Researchers , 2019, ArXiv.

[47]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[48]  Michael Figurnov,et al.  Monte Carlo Gradient Estimation in Machine Learning , 2019, J. Mach. Learn. Res..

[49]  Shimon Whiteson,et al.  Multitask Soft Option Learning , 2019, UAI.

[50]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..