Simulation-Based Methods for Markov Decision Processes

Markov decision processes have been a popular paradigm for sequential decision making under uncertainty. Dynamic programming provides a framework for studying such problems, as well as for devising algorithms to compute an optimal control policy. Dynamic programming methods rely on a suitably de ned value function that has to be computed for every state in the state space. However, many interesting problems involve very large state spaces (\curse of dimensionality"), which prohibits the application of dynamic programming . In addition, dynamic programming assumes the availability of an exact model, in the form of transition probabilities (\curse of modeling"). In many practical situations, such a model is not available and one must resort to simulation or experimentation with an actual system. For all of these reasons, dynamic programming in its pure form may be inapplicable. In this thesis we study an approach for overcoming these di culties where we use (a) compact (parametric) representations of the control policy, thus avoiding the curse of dimensionality, and (b) simulation to estimate quantities of interest, thus avoiding model-based computations (curse of modeling). It is not limited to Markov decision processes, but applies to general Markov reward processes for which the transition probabilities and the one-stage rewards depend on a parameter vector . We propose gradient-type algorithms for updating based on the simulation of a single sample path, so as to improve a given performance measure. As possible performance measures we consider the weighted reward-to-go and the average reward. The corresponding algorithms (a) can be implemented online and update the parameter vector either at visits to a certain state, or at every time step. (b) have the property that the gradient (with respect to ) of the performance measure converges to 0 with probability 1. This is the strongest possible result for gradientrelated stochastic approximation algorithms. We illustrate the methodology by considering the call admission control problem where a telecommunications provider sells bandwith of a single communication link to customers so as to optimize the revenue. We use the proposed algorithms to optimize the parameters a heuristic threshold policy for this problem. 2

[1]  G. J. Foschini,et al.  Optimum Allocation of Servers to Two Types of Competing Customers , 1981, IEEE Trans. Commun..

[2]  Peter W. Glynn,et al.  Proceedings of Ihe 1986 Winter Simulation , 2022 .

[3]  Zbigniew Dziong,et al.  Dynamic link bandwidth allocation in an integrated services network , 1989, IEEE International Conference on Communications, World Prosperity Through Communications,.

[4]  Pravin Varaiya,et al.  Control of multiple service, multiple resource communication networks , 1991, IEEE INFCOM '91. The conference on Computer Communications. Tenth Annual Joint Comference of the IEEE Computer and Communications Societies Proceedings.

[5]  Michael C. Fu,et al.  Smoothed perturbation analysis derivative estimation for Markov chains , 1994, Oper. Res. Lett..

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[8]  E. Chong,et al.  Stochastic optimization of regenerative systems using infinitesimal perturbation analysis , 1994, IEEE Trans. Autom. Control..

[9]  Robert G. Gallager,et al.  Discrete Stochastic Processes , 1995 .

[10]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  B. Delyon General results on the convergence of stochastic algorithms , 1996, IEEE Trans. Autom. Control..

[13]  John N. Tsitsiklis,et al.  Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks , 1997, NIPS.

[14]  V. Borkar Stochastic approximation with two time scales , 1997 .

[15]  John N. Tsitsiklis,et al.  A neuro-dynamic programming approach to call admission control in integrated service networks : the single link case , 1997 .

[16]  D. Bertsekas Gradient convergence in gradient methods , 1997 .

[17]  Keith W. Ross,et al.  Multiservice Loss Models for Broadband Telecommunication Networks , 1997 .

[18]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[19]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[20]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[21]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .