Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations

Model-free deep reinforcement learning (RL) has demonstrated its superiority on many complex sequential decision-making problems. However, heavy dependence on dense rewards and high sample-complexity impedes the wide adoption of these methods in real-world scenarios. On the other hand, imitation learning (IL) learns effectively in sparse-rewarded tasks by leveraging the existing expert demonstrations. In practice, collecting a sufficient amount of expert demonstrations can be prohibitively expensive, and the quality of demonstrations typically limits the performance of the learning policy. In this work, we propose Self-Adaptive Imitation Learning (SAIL) that can achieve (near) optimal performance given only a limited number of sub-optimal demonstrations for highly challenging sparse reward tasks. SAIL bridges the advantages of IL and RL to reduce the sample complexity substantially, by effectively exploiting sup-optimal demonstrations and efficiently exploring the environment to surpass the demonstrated performance. Extensive empirical results show that not only does SAIL significantly improve the sample-efficiency but also leads to much better final performance across different continuous control tasks, comparing to the state-of-the-art.

[1]  Masashi Sugiyama,et al.  Imitation Learning from Imperfect Demonstration , 2019, ICML.

[2]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[3]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[4]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[5]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[6]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[7]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[8]  Mikael Henaff,et al.  Disagreement-Regularized Imitation Learning , 2020, ICLR.

[9]  Anind K. Dey,et al.  Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[10]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[11]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[12]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[13]  Joelle Pineau,et al.  Learning from Limited Demonstrations , 2013, NIPS.

[14]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[15]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[16]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[17]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[18]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[19]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[20]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Byron Boots,et al.  Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[22]  Jiashi Feng,et al.  Policy Optimization with Demonstrations , 2018, ICML.

[23]  Tetsuya Yohira,et al.  Sample Efficient Imitation Learning for Continuous Control , 2018, ICLR.

[24]  Ilya Kostrikov,et al.  Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.

[25]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27]  Yiannis Demiris,et al.  Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation , 2019, ICML.

[28]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[29]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[30]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[31]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[32]  Byron Boots,et al.  Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.