Episodic Self-Imitation Learning with Hindsight

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state–action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[4]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[5]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[6]  Ken Goldberg,et al.  Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation , 2017, ICRA.

[7]  Lei Han,et al.  Curriculum-guided Hindsight Experience Replay , 2019, NeurIPS.

[8]  Richard Socher,et al.  Competitive Experience Replay , 2019, ICLR.

[9]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[10]  Pieter Abbeel,et al.  Goal-conditioned Imitation Learning , 2019, NeurIPS.

[11]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[12]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[13]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[14]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[15]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[16]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[17]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[18]  Nando de Freitas,et al.  Robust Imitation of Diverse Behaviors , 2017, NIPS.

[19]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[20]  Masashi Sugiyama,et al.  Imitation Learning from Imperfect Demonstration , 2019, ICML.

[21]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[22]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[23]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[24]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[25]  Andrew J. Davison,et al.  Task-Embedded Control Networks for Few-Shot Imitation Learning , 2018, CoRL.

[26]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[27]  Volker Tresp,et al.  Energy-Based Hindsight Experience Prioritization , 2018, CoRL.

[28]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[29]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[30]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[31]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[32]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[33]  Boqing Gong,et al.  DHER: Hindsight Experience Replay for Dynamic Goals , 2018, ICLR.

[34]  Mohamed Medhat Gaber,et al.  Imitation Learning , 2017, ACM Comput. Surv..

[35]  Samy Bengio,et al.  Self-Imitation Learning via Trajectory-Conditioned Policy for Hard-Exploration Tasks , 2019 .

[36]  Fuchun Sun,et al.  Survey of imitation learning for robotic manipulation , 2019, International Journal of Intelligent Robotics and Applications.

[37]  Yang Gao,et al.  End-to-End Learning of Driving Models from Large-Scale Video Datasets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[39]  Filipe Wall Mutz,et al.  Hindsight policy gradients , 2017, ICLR.

[40]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[41]  Tao Lu,et al.  Hindsight Generative Adversarial Imitation Learning , 2019, ArXiv.

[42]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[43]  Yunhao Tang Self-Imitation Learning via Generalized Lower Bound Q-learning , 2020, NeurIPS.

[44]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[45]  Sae-Young Chung,et al.  Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update , 2018, NeurIPS.

[46]  Qiang Liu,et al.  Learning Self-Imitating Diverse Policies , 2018, ICLR.

[47]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).