论文信息 - Distributed Distributional Deterministic Policy Gradients

Distributed Distributional Deterministic Policy Gradients

This work adopts the very successful distributional perspective on reinforcement learning and adapts it to the continuous control setting. We combine this within a distributed framework for off-policy learning in order to develop what we call the Distributed Distributional Deep Deterministic Policy Gradient algorithm, D4PG. We also combine this technique with a number of additional, simple improvements such as the use of $N$-step returns and prioritized experience replay. Experimentally we examine the contribution of each of these individual components, and show how they interact, as well as their combined contributions. Our results show that across a wide variety of simple control tasks, difficult manipulation tasks, and a set of hard obstacle-based locomotion tasks the D4PG algorithm achieves state of the art performance.

[1] R. Mazo. On the theory of brownian motion , 1973 .

[2] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[3] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[4] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[5] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[6] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[7] Martin A. Riedmiller,et al. Reinforcement learning in feedback control , 2011, Machine Learning.

[8] Stuart D. Harshbarger,et al. An Overview of the Developmental Process for the Modular Prosthetic Limb , 2011 .

[9] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[11] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Vikash Kumar,et al. MuJoCo HAPTIX: A virtual reality system for hand manipulation , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[14] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[15] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[16] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[17] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[18] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[19] Yuval Tassa,et al. Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[20] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[21] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22] Yuval Tassa,et al. Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[23] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[24] David Budden,et al. Distributed Prioritized Experience Replay , 2018, ICLR.