A policy gradient method for SMDPs with application to call admission control

Classical methods for solving a semi-Markov decision process such as value iteration and policy iteration require precise knowledge of the underlying probabilistic model and are know to suffer from the curse of dimensionality. To overcome both these limitations, this paper presents a reinforcement learning approach where one optimizes directly the performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simultaneously estimates the gradient of the performance criterion and optimises it through stochastic approximation. The gradient estimator is based on the discounted score method as introduced. We demonstrate the utility of our algorithm in a Call Admission Control problem.