论文信息 - Policy search via the signed derivative

Policy search via the signed derivative

We consider policy search for reinforcement learning: learning policy parameters, for some fixed policy class, that optimize performance of a system. In this paper, we propose a novel policy gradient method based on an approximation we call the Signed Derivative; the approximation is based on the intuition that it is often very easy to guess the direction in which control inputs affect future state variables, even if we do not have an accurate model of the system. The resulting algorithm is very simple, requires no model of the environment, and we show that it can outperform standard stochastic estimators of the gradient; indeed we show that Signed Derivative algorithm can in fact perform as well as the true (model-based) policy gradient, but without knowledge of the model. We evaluate the algorithm’s performance on both a simulated task and two realworld tasks — driving an RC car along a specified trajectory, and jumping onto obstacles with an quadruped robot — and in all cases achieve good performance after very little training.

Andrew Y. Ng | J. Zico Kolter | J. Z. Kolter | A. Ng

[1] S. Sastry,et al. Adaptive Control: Stability, Convergence and Robustness , 1989 .

[2] B. Pasik-Duncan,et al. Adaptive Control , 1996, IEEE Control Systems.

[3] Kevin L. Moore,et al. Iterative Learning Control: An Expository Overview , 1999 .

[4] Michael I. Jordan,et al. PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[5] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[6] Jeff G. Schneider,et al. Covariant policy search , 2003, IJCAI 2003.

[7] Peter L. Bartlett,et al. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[8] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9] Peter Stone,et al. Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[10] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[11] Pieter Abbeel,et al. Using inaccurate models in reinforcement learning , 2006, ICML.

[12] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[13] Russ Tedrake,et al. Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms , 2008, NIPS.

[14] Jan Peters,et al. Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .