Look Back When Surprised: Stabilizing Reverse Experience Replay for Neural Approximation

Experience replay methods, which are an essential part of reinforcement learning (RL) algorithms, are designed to mitigate spurious correlations and biases while learning from temporally dependent data. Roughly speaking, these methods allow us to draw batched data from a large buffer such that these temporal correlations do not hinder the performance of descent algorithms. In this experimental work, we consider the recently developed and theoretically rigorous reverse experience replay (RER), which has been shown to remove such spurious biases in simplified theoretical settings. We combine RER with optimistic experience replay (OER) to obtain RER++, which is stable under neural function approximation. We show via experiments that this has a better performance than techniques like prioritized experience replay (PER) on various tasks, with a significantly smaller computational complexity. It is well known in the RL literature that choosing examples greedily with the largest TD error (as in OER) or forming mini-batches with consecutive data points (as in RER) leads to poor performance. However, our method, which combines these techniques, works very well.

[1]  Pulkit Agrawal,et al.  Topological Experience Replay , 2022, ICLR.

[2]  Prateek Jain,et al.  Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs , 2021, ICLR.

[3]  Prateek Jain,et al.  Near-optimal Offline and Streaming Algorithms for Learning Non-Linear Dynamical Systems , 2021, NeurIPS.

[4]  Prateek Jain,et al.  Streaming Linear System Identification with Reverse Experience Replay , 2021, NeurIPS.

[5]  Doina Precup,et al.  An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay , 2020, NeurIPS.

[6]  Xian Wu,et al.  Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms , 2020, NeurIPS.

[7]  Young Min Kim,et al.  RL-GAN-Net: A Reinforcement Learning Agent Controlled GAN Network for Real-Time Point Cloud Shape Completion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jonathan Scholz,et al.  Generative predecessor models for sample-efficient imitation learning , 2019, ICLR.

[9]  R. Socher,et al.  Competitive Experience Replay , 2019, ICLR.

[10]  Boqing Gong,et al.  DHER: Hindsight Experience Replay for Dynamic Goals , 2018, International Conference on Learning Representations.

[11]  Sae-Young Chung,et al.  Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update , 2018, NeurIPS.

[12]  Sergey Levine,et al.  Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[13]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[14]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[15]  Matthew R. Walter,et al.  Jointly Learning to Construct and Control Agents using Deep Reinforcement Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[16]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[17]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[18]  Pieter Abbeel,et al.  Reverse Curriculum Generation for Reinforcement Learning , 2017, CoRL.

[19]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[20]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[21]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[22]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[23]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[26]  Jesse Hoey,et al.  On the convergence of techniques that improve value iteration , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[27]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Pawel Wawrzynski,et al.  A Cat-Like Robot Real-Time Learning to Run , 2009, ICANNGA.

[29]  Peng Dai,et al.  Prioritizing Bellman Backups without a Priority Queue , 2007, ICAPS.

[30]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[31]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[32]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[33]  Lei Han,et al.  Curriculum-guided Hindsight Experience Replay , 2019, NeurIPS.

[34]  R. Sutton,et al.  Advances in Neural Information Processing Systems pp MIT Press Generalization in Reinforcement Learning Successful Examples Using Sparse Coarse Coding , 2010 .

[35]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[36]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.