Learning to Sample with Local and Global Contexts in Experience Replay Buffer

Experience replay, which enables the agents to remember and reuse experience from the past, plays a significant role in the success of off-policy reinforcement learning (RL). To utilize the experience replay efficiently, experience transitions should be sampled with consideration of their significance, such that the known prioritized experience replay (PER) further allows to sample more important experience. Yet, the conventional PER may result in generating highly biased samples due to considering a single metric such as TD-error and computing the sampling rate independently for each experience. To tackle this issue, we propose a Neural Experience Replay Sampler (NERS), which adaptively evaluates the relative importance of a sampled transition by obtaining context from not only its (local) values that characterize itself such as TD-error or the raw features but also other (global) transitions. We validate our framework on multiple benchmark tasks for both continuous and discrete controls and show that the proposed framework significantly improves the performance of various off-policy RL methods. Further analysis confirms that the improvements indeed come from the use of diverse features and the consideration of the relative importance of experiences.

[1]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[2]  Daochen Zha,et al.  Experience Replay Optimization , 2019, IJCAI.

[3]  Richard S. Sutton,et al.  A Deeper Look at Experience Replay , 2017, ArXiv.

[4]  Li Li,et al.  Optimization of Molecules via Deep Reinforcement Learning , 2018, Scientific Reports.

[5]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[6]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[7]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[8]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[9]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[10]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[11]  Sae-Young Chung,et al.  Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update , 2018, NeurIPS.

[12]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[13]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[14]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[15]  Che Wang,et al.  Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past , 2019, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[18]  David Isele,et al.  Selective Experience Replay for Lifelong Learning , 2018, AAAI.

[19]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[20]  Matteo Hessel,et al.  When to use parametric models in reinforcement learning? , 2019, NeurIPS.

[21]  Andre Esteva,et al.  A guide to deep learning in healthcare , 2019, Nature Medicine.

[22]  Jiming Liu,et al.  Reinforcement Learning in Healthcare: A Survey , 2019, ACM Comput. Surv..

[23]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[24]  David Rolnick,et al.  Experience Replay for Continual Learning , 2018, NeurIPS.

[25]  Petros Koumoutsakos,et al.  Remember and Forget for Experience Replay , 2018, ICML.

[26]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[27]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[28]  Chunlin Chen,et al.  A novel DDPG method with prioritized experience replay , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[29]  Peng Wei,et al.  Prioritized Sequence Experience Replay , 2019, ArXiv.