A basic formula for online policy gradient algorithms

This note presents a (new) basic formula for sample-path-based estimates for performance gradients for Markov systems (called policy gradients in reinforcement learning literature). With this basic formula, many policy-gradient algorithms, including those that have previously appeared in the literature, can be easily developed. The formula follows naturally from a sensitivity equation in perturbation analysis. New research direction is discussed.

[1]  Y. Ho,et al.  Perturbation analysis and optimization of queueing networks , 1982, 1982 21st IEEE Conference on Decision and Control.

[2]  Xi-Ren Cao Convergence of parameter sensitivity estimates in a stochastic experiment , 1984, The 23rd IEEE Conference on Decision and Control.

[3]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[4]  Peter W. Glynn,et al.  Optimization Of Stochastic Systems Via Simulation , 1989, 1989 Winter Simulation Conference Proceedings.

[5]  P. Glynn Optimization of stochastic systems via simulation , 1989, WSC '89.

[6]  Alan Weiss,et al.  Sensitivity Analysis for Simulations via Likelihood Ratios , 1989, Oper. Res..

[7]  Xi-Ren Cao,et al.  Perturbation analysis of discrete event dynamic systems , 1991 .

[8]  Peter W. Glynn,et al.  Gradient estimation for ratios , 1991, 1991 Winter Simulation Conference Proceedings..

[9]  Xi-Ren Cao,et al.  Realization Probabilities: The Dynamics of Queuing Systems , 1994 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[12]  Xi-Ren Cao,et al.  A single sample path-based performance sensitivity formula for Markov chains , 1996, IEEE Trans. Autom. Control..

[13]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[14]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[15]  Christos G. Cassandras,et al.  Introduction to Discrete Event Systems , 1999, The Kluwer International Series on Discrete Event Dynamic Systems.

[16]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis , 2000, at - Automatisierungstechnik.

[17]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[18]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[19]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[20]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[21]  Xi-Ren Cao,et al.  From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[22]  Xi-Ren Cao,et al.  The potential structure of sample paths and performance sensitivities of Markov systems , 2004, IEEE Transactions on Automatic Control.

[23]  L. Breuer Introduction to Stochastic Processes , 2022, Statistical Methods for Climate Scientists.

[24]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .