论文信息 - Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations - 字舞流文

Learning Sparse Rewarded Tasks from Sub-Optimal Demonstrations

Model-free deep reinforcement learning (RL) has demonstrated its superiority on many complex sequential decision-making problems. However, heavy dependence on dense rewards and high sample-complexity impedes the wide adoption of these methods in real-world scenarios. On the other hand, imitation learning (IL) learns effectively in sparse-rewarded tasks by leveraging the existing expert demonstrations. In practice, collecting a sufficient amount of expert demonstrations can be prohibitively expensive, and the quality of demonstrations typically limits the performance of the learning policy. In this work, we propose Self-Adaptive Imitation Learning (SAIL) that can achieve (near) optimal performance given only a limited number of sub-optimal demonstrations for highly challenging sparse reward tasks. SAIL bridges the advantages of IL and RL to reduce the sample complexity substantially, by effectively exploiting sup-optimal demonstrations and efficiently exploring the environment to surpass the demonstrated performance. Extensive empirical results show that not only does SAIL significantly improve the sample-efficiency but also leads to much better final performance across different continuous control tasks, comparing to the state-of-the-art.

Jiayu Zhou | Bo Dai | Kaixiang Lin | Zhuangdi Zhu | Jiayu Zhou | Bo Dai | Zhuangdi Zhu | Kaixiang Lin

[1] Masashi Sugiyama,et al. Imitation Learning from Imperfect Demonstration , 2019, ICML.

[2] Tom Schaul,et al. Deep Q-learning From Demonstrations , 2017, AAAI.

[3] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[4] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[5] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[6] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[7] Stefan Schaal,et al. Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[8] Mikael Henaff,et al. Disagreement-Regularized Imitation Learning , 2020, ICLR.

[9] Anind K. Dey,et al. Modeling Interaction via the Principle of Maximum Causal Entropy , 2010, ICML.

[10] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[11] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[12] Martin A. Riedmiller,et al. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[13] Joelle Pineau,et al. Learning from Limited Demonstrations , 2013, NIPS.

[14] Sergey Levine,et al. Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[15] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[16] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[17] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[18] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[19] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[20] Sergey Levine,et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[21] Byron Boots,et al. Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction , 2017, ICML.

[22] Jiashi Feng,et al. Policy Optimization with Demonstrations , 2018, ICML.

[23] Tetsuya Yohira,et al. Sample Efficient Imitation Learning for Continuous Control , 2018, ICLR.

[24] Ilya Kostrikov,et al. Imitation Learning via Off-Policy Distribution Matching , 2019, ICLR.

[25] Marcin Andrychowicz,et al. Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[26] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27] Yiannis Demiris,et al. Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation , 2019, ICML.

[28] Sebastian Nowozin,et al. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[29] Robert E. Schapire,et al. A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[30] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[31] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[32] Byron Boots,et al. Truncated Horizon Policy Search: Combining Reinforcement Learning & Imitation Learning , 2018, ICLR.