TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large number. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide insight into the dynamical structure of the environment, showing what could be done in the environment even if not what should be done. We ask the question, is it possible to utilize such suboptimal offline datasets to facilitate provably improved downstream imitation learning? In this work, we answer this question affirmatively and present training objectives that use offline datasets to learn a factored transition model whose structure enables the extraction of a latent action space. Our theoretical analysis shows that the learned latent action space can boost the sample-efficiency of downstream imitation learning, effectively reducing the need for large near-optimal expert datasets through the use of auxiliary non-expert data. To learn the latent action space in practice, we propose TRAIL (Transition-Reparametrized Actions for Imitation Learning), an algorithm that learns an energy-based transition model contrastively, and uses the transition model to reparametrize the action space for sample-efficient imitation learning. We evaluate the practicality of our objective through experiments on a set of navigation and locomotion tasks. Our results verify the benefits suggested by our theory and show that TRAIL is able to improve baseline imitation learning by up to 4x in performance.1

[1]  Pushmeet Kohli,et al.  CompILE: Compositional Imitation Learning and Execution , 2018, ICML.

[2]  Ofir Nachum,et al.  Provable Representation Learning for Imitation with Contrastive Fourier Features , 2021, NeurIPS.

[3]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[4]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[5]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[6]  Marc G. Bellemare,et al.  DeepMDP: Learning Continuous Latent Space Models for Representation Learning , 2019, ICML.

[7]  Daniel Berend,et al.  On the Convergence of the Empirical Distribution , 2012, 1205.6711.

[8]  Sergey Levine,et al.  MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies , 2019, NeurIPS.

[9]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[10]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[11]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[12]  Ion Stoica,et al.  DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations , 2017, CoRL.

[13]  Doina Precup,et al.  Learning Options in Reinforcement Learning , 2002, SARA.

[14]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[15]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[16]  Sergey Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[17]  Sanjeev Arora,et al.  Provable Representation Learning for Imitation Learning via Bi-level Optimization , 2020, ICML.

[18]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[19]  Abhinav Gupta,et al.  Discovering Motor Programs by Recomposing Demonstrations , 2020, ICLR.

[20]  M. Mehdi Afsar,et al.  Reinforcement learning based recommender systems: A survey , 2021, ArXiv.

[21]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[22]  Ion Stoica,et al.  Multi-Level Discovery of Deep Options , 2017, ArXiv.

[23]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[24]  Sergey Levine,et al.  D4RL: Datasets for Deep Data-Driven Reinforcement Learning , 2020, ArXiv.

[25]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[26]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[27]  Bo Dai,et al.  Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach , 2021, EMNLP.

[28]  Misha Denil,et al.  Offline Learning from Demonstrations and Unlabeled Experience , 2020, ArXiv.

[29]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[30]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[31]  Pieter Abbeel,et al.  Skill Preferences: Learning to Extract and Execute Robotic Skills from Human Feedback , 2021, CoRL.

[32]  B. Krogh,et al.  State aggregation in Markov decision processes , 2002, Proceedings of the 41st IEEE Conference on Decision and Control, 2002..

[33]  Qiangxing Tian,et al.  Learn Goal-Conditioned Policy with Intrinsic Motivation for Deep Reinforcement Learning , 2021, AAAI.

[34]  Sergey Levine,et al.  OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning , 2021, ICLR.

[35]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[36]  Pieter Abbeel,et al.  Hierarchical Few-Shot Imitation with Skill Transition Models , 2021, ArXiv.

[37]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[38]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[39]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[40]  Abhinav Gupta,et al.  Learning Robot Skills with Temporal Variational Inference , 2020, ICML.

[41]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[42]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: Benchmarks for Offline Reinforcement Learning , 2020, ArXiv.

[43]  Vikash Kumar,et al.  Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real , 2019, CoRL.

[44]  Nikolai Matni,et al.  Closing the Closed-Loop Distribution Shift in Safe Imitation Learning , 2021, ArXiv.

[45]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[46]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[47]  Joseph J. Lim,et al.  Accelerating Reinforcement Learning with Learned Skill Priors , 2020, CoRL.

[48]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[49]  Sergey Levine,et al.  Parrot: Data-Driven Behavioral Priors for Reinforcement Learning , 2020, ICLR.

[50]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[51]  Doina Precup,et al.  Using Bisimulation for Policy Transfer in MDPs , 2010, AAAI.

[52]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[53]  Rowan McAllister,et al.  Learning Invariant Representations for Reinforcement Learning without Reconstruction , 2020, ICLR.

[54]  Yifan Wu,et al.  Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.