Multi-critic DDPG Method and Double Experience Replay

The remarkable Deep Deterministic Policy Gradient (DDPG) reinforcement learning method commonly consists of actor learning and critic learning. The actor learning highly relies on the critic learning, which makes the performance of DDPG method rather sensitive to critic learning and leads to stability issues. To further improve the stability and performance of DDPG method, the multi-critic DDPG method (MCDDPG) is proposed for a reliable critic learning. The average value of multiple critics is used to replace the single critic in DDPG method for better resistance when one critic performs badly, and multiple independent critics can learn knowledges from environment more widely. Besides, an extension of experience replay mechanism is revealed for accelerating the training process. All the methods are tested on simulated environments in OpenAI Gym platform, and convincing experiment results are obtained to support the proposed methods.

[1]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[2]  D. Dupret,et al.  Dopaminergic neurons promote hippocampal reactivation and spatial memory persistence , 2014, Nature Neuroscience.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Robert Babuska,et al.  Experience Replay for Real-Time Reinforcement Learning Control , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[8]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[9]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[10]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11]  Michael L. Littman,et al.  Reinforcement learning improves behaviour from evaluative feedback , 2015, Nature.

[12]  Chunlin Chen,et al.  A novel DDPG method with prioritized experience replay , 2017, 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[13]  Gang Chen,et al.  A sandpile model for reliable actor-critic reinforcement learning , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[14]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[15]  L. Frank,et al.  Rewarded Outcomes Enhance Reactivation of Experience in the Hippocampus , 2009, Neuron.

[16]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[17]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[18]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[19]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[20]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[21]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.