A basic formula for performance gradient estimation of semi-Markov decision processes

This paper presents a basic formula for performance gradient estimation of semi-Markov decision processes (SMDPs) under average-reward criterion. This formula directly follows from a sensitivity equation in perturbation analysis. With this formula, we develop three sample-path-based gradient estimation algorithms by using a single sample path. These algorithms naturally extend many gradient estimation algorithms for discrete-time Markov systems to continuous time semi-Markov models. In particular, they require less storage than the algorithm in the literature.

[1]  Arnaud Doucet,et al.  A policy gradient method for semi-Markov decision processes with application to call admission control , 2007, Eur. J. Oper. Res..

[2]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Xi-Ren Cao,et al.  On-Line Policy Gradient Estimation with Multi-Step Sampling , 2010, Discret. Event Dyn. Syst..

[5]  Xi-Ren Cao,et al.  Perturbation analysis of discrete event dynamic systems , 1991 .

[6]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[7]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[8]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[11]  Xi-Ren Cao,et al.  Perturbation analysis and optimization of queueing networks , 1983 .

[12]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[13]  Xi-Ren Cao,et al.  A single sample path-based performance sensitivity formula for Markov chains , 1996, IEEE Trans. Autom. Control..

[14]  P. Glynn,et al.  Likelihood ratio gradient estimation for stochastic recursions , 1995, Advances in Applied Probability.

[15]  Xi-Ren Cao,et al.  A basic formula for online policy gradient algorithms , 2005, IEEE Transactions on Automatic Control.

[16]  Javier A. Barria,et al.  Reinforcement Learning for Resource Allocation in LEO Satellite Networks , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[18]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[19]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[20]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[21]  Xi-Ren Cao,et al.  Semi-Markov decision problems and performance sensitivity analysis , 2003, IEEE Trans. Autom. Control..