论文信息 - Actor-critic versus direct policy search: a comparison based on sample complexity

Actor-critic versus direct policy search: a comparison based on sample complexity

Sample efficiency is a critical property when optimizing policy parameters for the controller of a robot. In this paper, we evaluate two state-of-the-art policy optimization algorithms. One is a recent deep reinforcement learning method based on an actor-critic algorithm, Deep Deterministic Policy Gradient (DDPG), that has been shown to perform well on various control benchmarks. The other one is a direct policy search method, Covariance Matrix Adaptation Evolution Strategy (CMA-ES), a black-box optimization method that is widely used for robot learning. The algorithms are evaluated on a continuous version of the mountain car benchmark problem, so as to compare their sample complexity. From a preliminary analysis, we expect DDPG to be more sample efficient than CMA-ES, which is confirmed by our experimental results.

Olivier Sigaud | Arnaud de Froissard de Broissia | Olivier Sigaud

[1] Olivier Sigaud,et al. From Motor Learning to Interaction Learning in Robots , 2010, From Motor Learning to Interaction Learning in Robots.

[2] Karl Tuyls,et al. The importance of experience replay database composition in deep reinforcement learning , 2015 .

[3] Olivier Sigaud,et al. Robot Skill Learning: From Reinforcement Learning to Evolution Strategies , 2013, Paladyn J. Behav. Robotics.

[4] Olivier Sigaud,et al. Autonomous online learning of velocity kinematics on the iCub: A comparative study , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5] Petros Koumoutsakos,et al. Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[6] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[7] Jun Nakanishi,et al. Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors , 2013, Neural Computation.

[8] Olivier Sigaud,et al. Learning Forward Models for the Operational Space Control of Redundant Robots , 2010, From Motor Learning to Interaction Learning in Robots.

[9] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[10] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[11] Muhammad Ghifary,et al. Compatible Value Gradients for Reinforcement Learning of Continuous Deep Policies , 2015, ArXiv.

[12] Olivier Sigaud,et al. Path Integral Policy Improvement with Covariance Matrix Adaptation , 2012, ICML.

[13] David Silver,et al. Memory-based control with recurrent neural networks , 2015, ArXiv.

[14] Martin A. Riedmiller,et al. Reinforcement learning in feedback control , 2011, Machine Learning.

[15] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[16] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[17] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[19] Sergey Levine,et al. Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[20] Yuval Tassa,et al. Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[21] Olivier Sigaud,et al. Policy Improvement Methods: Between Black-Box Optimization and Episodic Reinforcement Learning , 2012 .

[22] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[23] Leemon C Baird,et al. Reinforcement Learning With High-Dimensional, Continuous Actions , 1993 .

[24] Christian Igel,et al. Variable Metric Reinforcement Learning Methods Applied to the Noisy Mountain Car Problem , 2008, EWRL.