论文信息 - A policy gradient method for SMDPs with application to call admission control

A policy gradient method for SMDPs with application to call admission control

Classical methods for solving a semi-Markov decision process such as value iteration and policy iteration require precise knowledge of the underlying probabilistic model and are know to suffer from the curse of dimensionality. To overcome both these limitations, this paper presents a reinforcement learning approach where one optimizes directly the performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it through stochastic approximation. The gradient estimator is based on the discounted score method as introduced. We demonstrate the utility of our algorithm in a Call Admission Control problem.

[1] V. Tadić. Almost sure convergence of two time-scale stochastic approximation algorithms , 2004, Proceedings of the 2004 American Control Conference.

[2] John N. Tsitsiklis,et al. Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[3] W. D. Ray,et al. Stochastic Models: An Algorithmic Approach , 1995 .

[4] S. Mahadevan,et al. Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning , 1999 .

[5] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[6] Keith W. Ross,et al. Multiservice Loss Models for Broadband Telecommunication Networks , 1997 .

[7] Peter L. Bartlett,et al. Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[8] Michael O. Duff,et al. Reinforcement Learning Methods for Continuous-Time Markov Decision Problems , 1994, NIPS.