Remember and Forget for Experience Replay

Experience replay (ER) is a fundamental component of off-policy deep reinforcement learning (RL). ER recalls experiences from past iterations to compute gradient estimates for the current policy, increasing data-efficiency. However, the accuracy of such updates may deteriorate when the policy diverges from past behaviors and can undermine the performance of ER. Many algorithms mitigate this issue by tuning hyper-parameters to slow down policy changes. An alternative is to actively enforce the similarity between policy and the experiences in the replay memory. We introduce Remember and Forget Experience Replay (ReF-ER), a novel method that can enhance RL algorithms with parameterized policies. ReF-ER (1) skips gradients computed from experiences that are too unlikely with the current policy and (2) regulates policy changes within a trust region of the replayed behaviors. We couple ReF-ER with Q-learning, deterministic policy gradient and off-policy gradient methods. We find that ReF-ER consistently improves the performance of continuous-action, off-policy RL on fully observable benchmarks and partially observable flow control problems.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Antonio Celani,et al.  Flow Navigation by Smart Microswimmers via Reinforcement Learning , 2017, Physical review letters.

[3]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[4]  Petros Koumoutsakos,et al.  Efficient collective swimming by harnessing vortices through deep reinforcement learning , 2018, Proceedings of the National Academy of Sciences.

[5]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[6]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[7]  Martha White,et al.  Organizing Experience: a Deeper Look at Replay Mechanisms for Sample-Based Planning in Continuous State Domains , 2018, IJCAI.

[8]  Philippe Angot,et al.  A penalization method to take into account obstacles in incompressible viscous flows , 1999, Numerische Mathematik.

[9]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[10]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[11]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[12]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[13]  Diego Rossinelli,et al.  Synchronisation through learning for two self-propelled swimmers , 2015, Bioinspiration & biomimetics.

[14]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  David Isele,et al.  Selective Experience Replay for Lifelong Learning , 2018, AAAI.

[17]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[18]  Gautam Reddy,et al.  Learning to soar in turbulent environments , 2016, Proceedings of the National Academy of Sciences.

[19]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[20]  Sergey Levine,et al.  The Mirage of Action-Dependent Baselines in Reinforcement Learning , 2018, ICML.

[21]  Sergey Levine,et al.  Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[22]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[23]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[24]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[25]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[26]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[27]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[28]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[29]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[30]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[31]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[32]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[33]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[34]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[35]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[36]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[37]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[38]  Peter Henderson,et al.  Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control , 2017, ArXiv.

[39]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[40]  A. Chorin A Numerical Method for Solving Incompressible Viscous Flow Problems , 1997 .

[41]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[42]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[43]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[44]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[45]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[46]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.