Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning

We present relay policy learning, a method for imitation and reinforcement learning that can solve multi-stage, long-horizon robotic tasks. This general and universally-applicable, two-phase approach consists of an imitation learning stage that produces goal-conditioned hierarchical policies, and a reinforcement learning phase that finetunes these policies for task performance. Our method, while not necessarily perfect at imitation learning, is very amenable to further improvement via environment interaction, allowing it to scale to challenging long-horizon tasks. We simplify the long-horizon policy learning problem by using a novel data-relabeling algorithm for learning goal-conditioned hierarchical policies, where the low-level only acts for a fixed number of steps, regardless of the goal achieved. While we rely on demonstration data to bootstrap policy learning, we do not assume access to demonstrations of every specific tasks that is being solved, and instead leverage unstructured and unsegmented demonstrations of semantically meaningful behaviors that are not only less burdensome to provide, but also can greatly facilitate further improvement using reinforcement learning. We demonstrate the effectiveness of our method on a number of multi-stage, long-horizon manipulation tasks in a challenging kitchen simulation environment. Videos are available at this https URL

[1]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[2]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[3]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[4]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[6]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[7]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[8]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[9]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[10]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[11]  Vikash Kumar,et al.  MuJoCo HAPTIX: A virtual reality system for hand manipulation , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[12]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[13]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[15]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[16]  Ion Stoica,et al.  Multi-Level Discovery of Deep Options , 2017, ArXiv.

[17]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[18]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[19]  Gaurav S. Sukhatme,et al.  Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets , 2017, NIPS.

[20]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[22]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[23]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[24]  Joelle Pineau,et al.  OptionGAN: Learning Joint Reward-Policy Options using Generative Adversarial Inverse Reinforcement Learning , 2017, AAAI.

[25]  Sergey Levine,et al.  Divide-and-Conquer Reinforcement Learning , 2017, ICLR.

[26]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Nando de Freitas,et al.  Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[28]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[29]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[30]  Nan Jiang,et al.  Hierarchical Imitation and Reinforcement Learning , 2018, ICML.

[31]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[32]  Mohit Sharma,et al.  Directed-Info GAIL: Learning Hierarchical Policies from Unsegmented Demonstrations using Directed Information , 2018, ICLR.

[33]  Brijen Thananjeyan,et al.  SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards , 2018, Int. J. Robotics Res..

[34]  Learning Latent Plans from Play , 2019, CoRL.

[35]  Pushmeet Kohli,et al.  CompILE: Compositional Imitation Learning and Execution , 2018, ICML.

[36]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.