Gradient in Continuous Time

Policy search is a method for approximately solving an optim al control problem by performing a parametric optimization search in a given class of paramet eriz d policies. In order to process a local optimization technique, such as a gradient method, w e wish to evaluate the sensitivity of the performance measure with respect to the policy paramete rs, th so-calledpolicy gradient . This paper is concerned with the estimation of the policy gradien t for continuous-time, deterministic state dynamics, in a reinforcement learningframework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete -tim , fail to proceed the gradient because they are subject to variance explosion when the disc retization time-step decreases to 0. We describe an alternative approach based on the approximat ion of the pathwise derivative, which leads to a policy gradient estimate that converges almost su rely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivati on of an explicit representation of the policy gradient using pathwise derivation. This derivatio n makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the ob s rvable data only, we use a stochastic policy to discretize the continuous deterministic system i nto a stochastic discrete process, which enables to replace the unknown coefficients by quantities th at solely depend on known data. We prove the almost sure convergence of this estimate to the tru e policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discret e and continuous control spaces.

[1]  W. Wonham,et al.  Topics in mathematical system theory , 1972, IEEE Transactions on Automatic Control.

[2]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[3]  Alan Weiss,et al.  Sensitivity analysis via likelihood ratios , 1986, WSC '86.

[4]  H. Kushner,et al.  A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems , 1991 .

[5]  P. Kloeden,et al.  Numerical Solutions of Stochastic Differential Equations , 1995 .

[6]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[7]  M. Talagrand A new look at independence , 1996 .

[8]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[9]  M. Ledoux The concentration of measure phenomenon , 2001 .

[10]  E. Gobet SENSITIVITY ANALYSIS USING ITÔ – MALLIAVIN CALCULUS AND , 2002 .

[11]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[12]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[13]  Alexander Y. Bogdanov,et al.  Optimal Control of a Double Inverted Pendulum on a Cart , 2004 .

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[16]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .