Skill-based Model-based Reinforcement Learning

Model-based reinforcement learning (RL) is a sample-efficient way of learning complex behaviors by leveraging a learned single-step dynamics model to plan actions in imagination. However, planning every action for long-horizon tasks is not practical, akin to a human planning out every muscle movement. Instead, humans efficiently plan with high-level skills to solve complex tasks. From this intuition, we propose a Skill-based Model-based RL framework (SkiMo) that enables planning in the skill space using a skill dynamics model, which directly predicts the skill outcomes, rather than predicting all small details in the intermediate states, step by step. For accurate and efficient long-term planning, we jointly learn the skill dynamics model and a skill repertoire from prior experience. We then harness the learned skill dynamics model to accurately simulate and plan over long horizons in the skill space, which enables efficient downstream learning of long-horizon, sparse reward tasks. Experimental results in navigation and manipulation domains show that SkiMo extends the temporal horizon of model-based approaches and improves the sample efficiency for both model-based RL and skill-based RL. Code and videos are available at https://clvrai.com/skimo

[1]  Xiaolong Wang,et al.  Temporal Difference Learning for Model Predictive Control , 2022, ICML.

[2]  W. Burgard,et al.  CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks , 2021, IEEE Robotics and Automation Letters.

[3]  Joseph J. Lim,et al.  Adversarial Skill Chaining for Long-Horizon Robot Manipulation via Terminal State Regularization , 2021, CoRL.

[4]  S. Levine,et al.  Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning , 2021, ICLR.

[5]  Ruslan Salakhutdinov,et al.  Accelerating Robotic Reinforcement Learning via Parameterized Action Primitives , 2021, NeurIPS.

[6]  Oleh Rybkin,et al.  Discovering and Achieving Goals via World Models , 2021, NeurIPS.

[7]  Li Fei-Fei,et al.  Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks , 2021, CoRL.

[8]  Joseph J. Lim,et al.  Demonstration-Guided Reinforcement Learning with Learned Skills , 2021, CoRL.

[9]  Joshua B. Tenenbaum,et al.  Learning Task Decomposition with Ordered Memory Policy Network , 2021, ICLR.

[10]  P. Abbeel,et al.  Reset-Free Lifelong Learning with Skill-Space Planning , 2020, ICLR.

[11]  Florian Shkurti,et al.  Latent Skill Planning for Exploration and Transfer , 2020, ICLR.

[12]  Joseph J. Lim,et al.  Accelerating Reinforcement Learning with Learned Skill Priors , 2020, CoRL.

[13]  Gabriel Dulac-Arnold,et al.  Model-Based Offline Planning , 2020, ICLR.

[14]  Abhinav Gupta,et al.  Learning Robot Skills with Temporal Variational Inference , 2020, ICML.

[15]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[16]  Joseph J. Lim,et al.  Learning to Coordinate Manipulation Skills via Skill Behavior Diversification , 2020, ICLR.

[17]  Abhinav Gupta,et al.  Discovering Motor Programs by Recomposing Demonstrations , 2020, ICLR.

[18]  Justin Fu,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[19]  Li Fei-Fei,et al.  Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations , 2020, Robotics: Science and Systems.

[20]  Jimmy Ba,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[21]  Joseph J. Lim,et al.  IKEA Furniture Assembly Environment for Long-Horizon Complex Manipulation Tasks , 2019, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[22]  D. Fox,et al.  IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Sergey Levine,et al.  Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[24]  S. Levine,et al.  RoboNet: Large-Scale Multi-Robot Learning , 2019, CoRL.

[25]  S. Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[26]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[27]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[28]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[29]  S. Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[30]  Pushmeet Kohli,et al.  CompILE: Compositional Imitation Learning and Execution , 2018, ICML.

[31]  Li Fei-Fei,et al.  ROBOTURK: A Crowdsourcing Platform for Robotic Skill Learning through Imitation , 2018, CoRL.

[32]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[33]  Joseph J. Lim,et al.  Composing Complex Skills by Learning Transition Policies , 2018, ICLR.

[34]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[35]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[36]  Shimon Whiteson,et al.  TACO: Learning Task Decomposition via Temporal Alignment for Control , 2018, ICML.

[37]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[38]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[39]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[40]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[41]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[42]  James M. Rehg,et al.  Aggressive driving with model predictive path integral control , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[44]  Evangelos Theodorou,et al.  Model Predictive Path Integral Control using Covariance Variable Importance Sampling , 2015, ArXiv.

[45]  Ari Weinstein,et al.  Model-based hierarchical reinforcement learning and human action control , 2014, Philosophical Transactions of the Royal Society B: Biological Sciences.

[46]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[47]  Oliver Kroemer,et al.  Learning to select and generalize striking movements in robot table tennis , 2012, AAAI Fall Symposium: Robots Learning Interactively from Human Teachers.

[48]  Stefan Schaal,et al.  Learning and generalization of motor skills by learning from demonstration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[49]  Shane Legg,et al.  Universal Intelligence: A Definition of Machine Intelligence , 2007, Minds and Machines.

[50]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[51]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[52]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[53]  TACO: Learning Task Decomposition via Temporal Alignment for Control - , 2018 .

[54]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[55]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .