Solving Compositional Reinforcement Learning Problems via Task Reduction

We propose a novel learning paradigm, Self-Imitation via Reduction (SIR), for solving compositional reinforcement learning problems. SIR is based on two core ideas: task reduction and self-imitation. Task reduction tackles a hard-to-solve task by actively reducing it to an easier task whose solution is known by the RL agent. Once the original hard task is successfully solved by task reduction, the agent naturally obtains a self-generated solution trajectory to imitate. By continuously collecting and imitating such demonstrations, the agent is able to progressively expand the solved subspace in the entire task space. Experiment results show that SIR can significantly accelerate and improve learning on a variety of challenging sparse-reward continuous-control problems with compositional structures. Code and videos are available at

[1]  Dan Klein,et al.  Modular Multitask Reinforcement Learning with Policy Sketches , 2016, ICML.

[2]  Igor Mordatch,et al.  Emergent Tool Use From Multi-Agent Autocurricula , 2019, ICLR.

[3]  Thomas L. Griffiths,et al.  Automatically Composing Representation Transformations as a Means for Generalization , 2018, ICLR.

[4]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[5]  Jason J. Corso,et al.  Floyd-Warshall Reinforcement Learning Learning from Past Experiences to Reach New Goals , 2018, ArXiv.

[6]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[7]  Boqing Gong,et al.  DHER: Hindsight Experience Replay for Dynamic Goals , 2018, ICLR.

[8]  Manuela M. Veloso,et al.  Probabilistic policy reuse in a reinforcement learning agent , 2006, AAMAS '06.

[9]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[10]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[11]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[12]  Samy Bengio,et al.  Efficient Exploration with Self-Imitation Learning via Trajectory-Conditioned Policy , 2019, ArXiv.

[13]  Sergey Levine,et al.  Composable Deep Reinforcement Learning for Robotic Manipulation , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[15]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[16]  Christopher Burgess,et al.  DARLA: Improving Zero-Shot Transfer in Reinforcement Learning , 2017, ICML.

[17]  Chelsea Finn,et al.  Language as an Abstraction for Hierarchical Deep Reinforcement Learning , 2019, NeurIPS.

[18]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[19]  Andrew G. Barto,et al.  Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[20]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[21]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[22]  Allan Jabri,et al.  Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Sergey Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[24]  Sergey Levine,et al.  Contextual Imagined Goals for Self-Supervised Robotic Learning , 2019, CoRL.

[25]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[26]  Chelsea Finn,et al.  Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation , 2019, ICLR.

[27]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Sergey Levine,et al.  MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies , 2019, NeurIPS.

[29]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[30]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[31]  Pierre-Yves Oudeyer,et al.  Teacher algorithms for curriculum learning of Deep RL in continuously parameterized environments , 2019, CoRL.

[32]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[33]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[34]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[35]  David Silver,et al.  Compositional Planning Using Optimal Option Models , 2012, ICML.

[36]  Allan Jabri,et al.  Universal Planning Networks: Learning Generalizable Representations for Visuomotor Control , 2018, ICML.

[37]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[38]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[39]  Kenneth O. Stanley,et al.  POET: open-ended coevolution of environments and their optimized solutions , 2019, GECCO.

[40]  Stuart J. Russell,et al.  Meta-Learning MCMC Proposals , 2017, NeurIPS.

[41]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[42]  Dushyant Rao,et al.  Data-efficient Hindsight Off-policy Option Learning , 2020, ArXiv.

[43]  Sergey Levine,et al.  Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning , 2019, CoRL.

[44]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[45]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[46]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[47]  Sergey Levine,et al.  Compositional Plan Vectors , 2019, NeurIPS.

[48]  Abhishek Gupta,et al.  Learning To Reach Goals Without Reinforcement Learning , 2019, ArXiv.

[49]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[50]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[51]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[52]  Jitendra Malik,et al.  Zero-Shot Visual Imitation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[54]  Andrew K. Lampinen,et al.  Automated curriculum generation through setter-solver interactions , 2020, ICLR.

[55]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[56]  Ali Farhadi,et al.  Target-driven visual navigation in indoor scenes using deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[57]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[58]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[59]  Pieter Abbeel,et al.  Goal-conditioned Imitation Learning , 2019, NeurIPS.

[60]  Nebojsa Jojic,et al.  Iterative Refinement of the Approximate Posterior for Directed Belief Networks , 2015, NIPS.

[61]  Scott Kuindersma,et al.  Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[62]  Sergey Levine,et al.  Planning with Goal-Conditioned Policies , 2019, NeurIPS.

[63]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[64]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[65]  Yuan Zhou,et al.  Exploration via Hindsight Goal Generation , 2019, NeurIPS.

[66]  Saurabh Kumar,et al.  Learning to Compose Skills , 2017, ArXiv.

[67]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[68]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[69]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[70]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[71]  George Konidaris,et al.  Option Discovery using Deep Skill Chaining , 2020, ICLR.

[72]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[73]  Javier García,et al.  Probabilistic Policy Reuse for inter-task transfer learning , 2010, Robotics Auton. Syst..

[74]  Michael J. Watts,et al.  IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Publication Information , 2020, IEEE Transactions on Neural Networks and Learning Systems.