Goal-Conditioned Reinforcement Learning with Imagined Subgoals

Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning. In this work, we propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic. This high-level policy predicts intermediate states halfway to the goal using the value function as a reachability metric. We don’t require the policy to reach these subgoals explicitly. Instead, we use them to define a prior policy, and incorporate this prior into a KL-constrained policy iteration scheme to speed up and regularize learning. Imagined subgoals are used during policy learning, but not during test time, where we only apply the learned policy. We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.

[1]  Bradly C. Stadie,et al.  World Model as a Graph: Learning Latent Landmarks for Planning , 2021, ICML.

[2]  Ilya Kostrikov,et al.  Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels , 2020, ArXiv.

[3]  Jimmy Ba,et al.  Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning , 2020, ICML.

[4]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[5]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[6]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[7]  Yee Whye Teh,et al.  Exploiting Hierarchy for Learning and Transfer in KL-regularized RL , 2019, ArXiv.

[8]  Sergey Levine,et al.  Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[9]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[10]  Pieter Abbeel,et al.  Reinforcement Learning with Augmented Data , 2020, NeurIPS.

[11]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[12]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[13]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[14]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[15]  Chelsea Finn,et al.  Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation , 2019, ICLR.

[16]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[17]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[18]  Sergey Levine,et al.  Search on the Replay Buffer: Bridging Planning and Reinforcement Learning , 2019, NeurIPS.

[19]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[20]  Satinder Singh,et al.  Many-Goals Reinforcement Learning , 2018, ArXiv.

[21]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[22]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[23]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[24]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[25]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[26]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[27]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[28]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[29]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[30]  Joseph J. Lim,et al.  Accelerating Reinforcement Learning with Learned Skill Priors , 2020, CoRL.

[31]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.

[32]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[33]  Chelsea Finn,et al.  Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors , 2020, NeurIPS.

[34]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[35]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[36]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[37]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[38]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[39]  Jessica B. Hamrick,et al.  Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning , 2020, ArXiv.

[40]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[41]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[42]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[43]  Sergey Levine,et al.  Planning with Goal-Conditioned Policies , 2019, NeurIPS.

[44]  Yee Whye Teh,et al.  Distral: Robust multitask reinforcement learning , 2017, NIPS.

[45]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[46]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[47]  Rui Zhao,et al.  Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[48]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[49]  Sergey Levine,et al.  C-Learning: Learning to Achieve Goals via Recursive Classification , 2020, ICLR.

[50]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.