Adaptive Step-Size for Policy Gradient Methods

In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement-learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step-size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second-order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear-quadratic regulator problem.

[1]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[2]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[4]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[5]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[6]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[7]  David J. Thuente,et al.  Line search algorithms with guaranteed sufficient decrease , 1994, TOMS.

[8]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[9]  Paul Wagner,et al.  A reinterpretation of the policy oscillation phenomenon in approximate policy iteration , 2011, NIPS.

[10]  Marc Toussaint,et al.  Learning model-free robot control by a Monte Carlo EM algorithm , 2009, Auton. Robots.

[11]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[12]  H. Robbins A Stochastic Approximation Method , 1951 .

[13]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14]  Jan Peters,et al.  Noname manuscript No. (will be inserted by the editor) Policy Search for Motor Primitives in Robotics , 2022 .

[15]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[16]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[17]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.