The importance of experience replay database composition in deep reinforcement learning

Recent years have seen a growing interest in the use of deep neural networks as function approximators in reinforcement learning. This paper investigates the potential of the Deep Deterministic Policy Gradient method for a robot control problem both in simulation and in a real setup. The importance of the size and composition of the experience replay database is investigated and some requirements on the distribution over the state-action space of the experiences in the database are identified. Of particular interest is the importance of negative experiences that are not close to an optimal policy. It is shown how training with samples that are insufficiently spread over the state-action space can cause the method to fail, and how maintaining the distribution over the state-action space of the samples in the experience database can greatly benefit learning.

[1]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[2]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[3]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[7]  Paul M.J. Van den Hof,et al.  Closed-Loop Issues in System Identification , 1997 .

[8]  Sergey Levine,et al.  Exploring Deep and Recurrent Architectures for Optimal Control , 2013, ArXiv.

[9]  Martin A. Riedmiller,et al.  Reinforcement learning in feedback control , 2011, Machine Learning.

[10]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[11]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[12]  Robert Babuška,et al.  On-line Reinforcement Learning for Nonlinear Motion Control: Quadratic and Non-Quadratic Reward Functions , 2014 .

[13]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[14]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Pawel Wawrzynski,et al.  Real-time reinforcement learning by sequential Actor-Critics and experience replay , 2009, Neural Networks.

[17]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[18]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).