Efficient hindsight reinforcement learning using demonstrations for robotic tasks with sparse rewards

The goal of reinforcement learning is to enable an agent to learn by using rewards. However, some robotic tasks naturally specify with sparse rewards, and manually shaping reward functions is a difficult project. In this article, we propose a general and model-free approach for reinforcement learning to learn robotic tasks with sparse rewards. First, a variant of Hindsight Experience Replay, Curious and Aggressive Hindsight Experience Replay, is proposed to improve the sample efficiency of reinforcement learning methods and avoid the need for complicated reward engineering. Second, based on Twin Delayed Deep Deterministic policy gradient algorithm, demonstrations are leveraged to overcome the exploration problem and speed up the policy training process. Finally, the action loss is added into the loss function in order to minimize the vibration of output action while maximizing the value of the action. The experiments on simulated robotic tasks are performed with different hyperparameters to verify the effectiveness of our method. Results show that our method can effectively solve the sparse reward problem and obtain a high learning speed.

[1]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[2]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[3]  Stefano Ermon,et al.  InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[4]  Volker Tresp,et al.  Curiosity-Driven Experience Prioritization via Density Estimation , 2018, ArXiv.

[5]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[6]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[7]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[8]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[9]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[10]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[11]  Stefan Schaal,et al.  Learning locomotion over rough terrain using terrain templates , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[14]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Stuart J. Russell Learning agents for uncertain environments (extended abstract) , 1998, COLT' 98.

[17]  Shie Mannor,et al.  End-to-End Differentiable Adversarial Imitation Learning , 2017, ICML.

[18]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[19]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[20]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[21]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[24]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[25]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Tetsuya Yohira,et al.  Sample Efficient Imitation Learning for Continuous Control , 2018, ICLR.

[27]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[28]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[29]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[30]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[31]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[32]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[33]  Abdelkader El Kamel,et al.  Neural inverse reinforcement learning in autonomous navigation , 2016, Robotics Auton. Syst..

[34]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[35]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[36]  Gaurav S. Sukhatme,et al.  Multi-Modal Imitation Learning from Unstructured Demonstrations using Generative Adversarial Nets , 2017, NIPS.

[37]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[38]  Tom Schaul,et al.  Learning from Demonstrations for Real World Reinforcement Learning , 2017, ArXiv.

[39]  Matthias J. Gruber,et al.  States of Curiosity Modulate Hippocampus-Dependent Learning via the Dopaminergic Circuit , 2014, Neuron.

[40]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[41]  Zhenmin Tang,et al.  A Novel Dynamic Obstacle Avoidance Algorithm Based on Collision Time Histogram , 2017 .

[42]  Tianfu Wu,et al.  ARCHER: Aggressive Rewards to Counter bias in Hindsight Experience Replay , 2018, ArXiv.

[43]  Jürgen Schmidhuber,et al.  A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots , 2016, IEEE Robotics and Automation Letters.