Semi-Markov decision problems and performance sensitivity analysis

Recent research indicates that Markov decision processes (MDPs) can be viewed from a sensitivity point of view; and the perturbation analysis (PA), MDPs, and reinforcement learning (RL) are three closely related areas in optimization of discrete-event dynamic systems that can be modeled as Markov processes. The goal of this paper is two-fold. First, we develop the PA theory for semi-Markov processes (SMPs); and then we extend the aforementioned results about the relation among PA, MDP, and RL to SMPs. In particular, we show that performance sensitivity formulas and policy iteration algorithms of semi-Markov decision processes can be derived based on the performance potential and realization matrix. Both the long-run average and discounted-cost problems are considered. This approach provides a unified framework for both problems, and the long-run average problem corresponds to the discounted factor being zero. The results indicate that performance sensitivities and optimization depend only on first-order statistics. Single sample path-based implementations are discussed.

[1]  John G. Kemeny,et al.  Finite Markov Chains. , 1960 .

[2]  A. Berman,et al.  Cones and Iterative Methods for Best Least Squares Solutions of Linear Systems , 1974 .

[3]  Leonard Kleinrock,et al.  Queueing Systems: Volume I-Theory , 1975 .

[4]  P. Glynn A Lyapunov Bound for Solutions of Poisson's Equation , 1989 .

[5]  Xi-Ren Cao,et al.  Perturbation analysis of discrete event dynamic systems , 1991 .

[6]  Yu-Chi Ho,et al.  Introduction to Discrete Event Dynamic Systems , 1991 .

[7]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[8]  Robert J. Plemmons,et al.  Nonnegative Matrices in the Mathematical Sciences , 1979, Classics in Applied Mathematics.

[9]  Xi-Ren Cao,et al.  Realization Probabilities: The Dynamics of Queuing Systems , 1994 .

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[12]  W. D. Ray,et al.  Stochastic Models: An Algorithmic Approach , 1995 .

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  Sean P. Meyn,et al.  A Liapounov bound for solutions of the Poisson equation , 1996 .

[15]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[16]  Xi-Ren Cao The Maclaurin series for performance functions of Markov chains , 1998, Advances in Applied Probability.

[17]  Xi-Ren Cao,et al.  The Relations Among Potentials, Perturbation Analysis, and Markov Decision Processes , 1998, Discret. Event Dyn. Syst..

[18]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[19]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis , 2000, at - Automatisierungstechnik.

[20]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[21]  Zhiyuan Ren,et al.  A time aggregation approach to Markov decision processes , 2002, Autom..

[22]  Xi-Ren Cao,et al.  From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .