MHER: Model-based Hindsight Experience Replay

Solving multi-goal reinforcement learning (RL) problems with sparse rewards is generally challenging. Existing approaches have utilized goal relabeling on collected experiences to alleviate issues raised from sparse rewards. However, these methods are still limited in efficiency and cannot make full use of experiences. In this paper, we propose Model-based Hindsight Experience Replay (MHER), which exploits experiences more efficiently by leveraging environmental dynamics to generate virtual achieved goals. Replacing original goals with virtual goals generated from interaction with a trained dynamics model leads to a novel relabeling method, model-based relabeling (MBR). Based on MBR, MHER performs both reinforcement learning and supervised learning for efficient policy improvement. Theoretically, we also prove the supervised part in MHER, i.e., goal-conditioned supervised learning with MBR data, optimizes a lower bound on the multi-goal RL objective. Experimental results in several point-based tasks and simulated robotics environments show that MHER achieves significantly higher sample efficiency than previous state-of-the-art methods.

[1]  Pieter Abbeel,et al.  Generalized Hindsight for Reinforcement Learning , 2020, NeurIPS.

[2]  Rui Zhao,et al.  Maximum Entropy-Regularized Multi-Goal Reinforcement Learning , 2019, ICML.

[3]  Pieter Abbeel,et al.  Visual Hindsight Experience Replay , 2019, ArXiv.

[4]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[5]  Peter Stone,et al.  Learning Predictive State Representations , 2003, ICML.

[6]  Sergey Levine,et al.  Learning to Reach Goals via Iterated Supervised Learning , 2019, ICLR.

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[9]  Jiangpeng Yan,et al.  Efficient Continuous Control with Double Actors and Regularized Critics , 2021, AAAI.

[10]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[11]  Tom Schaul,et al.  Better Generalization with Forecasts , 2013, IJCAI.

[12]  Giovanni Montana,et al.  PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals , 2020, NeurIPS.

[13]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[14]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[15]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[16]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[17]  Sergey Levine,et al.  Rewriting History with Inverse RL: Hindsight Inference for Policy Improvement , 2020, NeurIPS.

[18]  Lei Han,et al.  Curriculum-guided Hindsight Experience Replay , 2019, NeurIPS.

[19]  Michael I. Jordan,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[20]  Yong Yu,et al.  MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks , 2021, IJCAI.

[21]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[23]  Jimmy Ba,et al.  Maximum Entropy Gain Exploration for Long Horizon Multi-goal Reinforcement Learning , 2020, ICML.

[24]  Xiaotong Liu,et al.  Policy Continuation with Hindsight Inverse Dynamics , 2019, NeurIPS.

[25]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[26]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[28]  Pieter Abbeel,et al.  Goal-conditioned Imitation Learning , 2019, NeurIPS.

[29]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[30]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[31]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[32]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[33]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.