论文信息 - A basic formula for performance gradient estimation of semi-Markov decision processes

A basic formula for performance gradient estimation of semi-Markov decision processes

This paper presents a basic formula for performance gradient estimation of semi-Markov decision processes (SMDPs) under average-reward criterion. This formula directly follows from a sensitivity equation in perturbation analysis. With this formula, we develop three sample-path-based gradient estimation algorithms by using a single sample path. These algorithms naturally extend many gradient estimation algorithms for discrete-time Markov systems to continuous time semi-Markov models. In particular, they require less storage than the algorithm in the literature.

Yanjie Li | Fang Cao

[1] Arnaud Doucet,et al. A policy gradient method for semi-Markov decision processes with application to call admission control , 2007, Eur. J. Oper. Res..

[2] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[3] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4] Xi-Ren Cao,et al. On-Line Policy Gradient Estimation with Multi-Step Sampling , 2010, Discret. Event Dyn. Syst..

[5] Xi-Ren Cao,et al. Perturbation analysis of discrete event dynamic systems , 1991 .

[6] John N. Tsitsiklis,et al. Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[7] Xi-Ren Cao,et al. Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[8] Sheldon M. Ross,et al. Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[9] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10] Xi-Ren Cao,et al. Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[11] Xi-Ren Cao,et al. Perturbation analysis and optimization of queueing networks , 1983 .

[12] Vijay R. Konda,et al. OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[13] Xi-Ren Cao,et al. A single sample path-based performance sensitivity formula for Markov chains , 1996, IEEE Trans. Autom. Control..

[14] P. Glynn,et al. Likelihood ratio gradient estimation for stochastic recursions , 1995, Advances in Applied Probability.

[15] Xi-Ren Cao,et al. A basic formula for online policy gradient algorithms , 2005, IEEE Transactions on Automatic Control.

[16] Javier A. Barria,et al. Reinforcement Learning for Resource Allocation in LEO Satellite Networks , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[18] Xi-Ren Cao,et al. Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[19] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[20] Peter W. Glynn,et al. Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[21] Xi-Ren Cao,et al. Semi-Markov decision problems and performance sensitivity analysis , 2003, IEEE Trans. Autom. Control..