Policy Gradient Semi-markov Decision Process

This paper proposes a simulation-based algorithm for optimizing the average reward in a parameterized continuous-time, finite-state semi-Markov decision process (SMDP). Our contributions are twofold: First, we compute the approximate gradient of the average reward with respect to the parameters in SMDP controlled by parameterized stochastic policies. Then stochastic gradient ascent method is used to adjust the parameters in order to optimize the average reward. Second, we present a simulation-based algorithm to estimate the approximate average gradient of the average reward (GSMDP), using only single sample path of the underlying Markov chain. We prove the almost sure convergence of this estimate to the true gradient of the average reward when the number of iterations goes to infinity.

[1]  G. T. Bartha Advantage updating on a mobile robot , 1994, Proceedings of PerAc '94. From Perception to Action.

[2]  Abhijit Gosavi,et al.  Reinforcement learning for long-run average cost , 2004, Eur. J. Oper. Res..

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[5]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[6]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[7]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  Michael O. Duff,et al.  Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.

[10]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[11]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[12]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  John N. Tsitsiklis,et al.  Call admission control and routing in integrated services networks using neuro-dynamic programming , 2000, IEEE Journal on Selected Areas in Communications.

[15]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.