Self Learning Control of Constrained Markov Decision Processes - A Gradient Approach

We present stochastic approximation algorithms for computing the locally optimal policy of a constrained average cost finite state Markov Decision process. Because the optimal control strategy is known to be a randomized policy, we consider here a parameterization of the action probabilities to establish the optimization problem. The stochastic approximation algorithms require computation of the gradient of the cost function with respect to the parameter that characterizes the randomized policy. This is computed by novel simulation based gradient estimation schemes involving weak derivatives. Similar to neuro-dynamic programming algorithms (e.g. Q-learning or Temporal Difference methods), the algorithms proposed in this paper are simulation based and do not require explicit knowledge of the underlying parameters such as transition probabilities. However, unlike neuro-dynamic programming methods, the algorithms proposed here can handle constraints and time varying parameters. Numerical examples are given to illustrate the performance of the algorithms. Resume Nous considerons le probleme du controle optimale des châines de Markov commandees (MDP) avec des contraintes. Sous la randomisation des actions, le probleme est parametrise et peut se re-ecrire en terme d’un probleme d’optimisation non-lineaire avec des contraintes non -lineaires, pour lequel on applique l’approximation stochastique. Celle-ci a besoin de calculer certains gradients par rapport aux parametres de controle. Nous proposons une nouvelle methode pour l’estimationde tels gradients avec des derivees faibles. Notre methode est robuste (comme les algorithmes de Q-learning et des differences temporales) et ne suppose pas que les probabilites de transition soient connues. Par ailleurs, notre methode peut etre appliquee lors qu’il y a des contraintes, contrairement aux autres methodes mentionees. Nous presentons aussi des exemples numeriques pour illustrer la performance de nos algorithmes. Acknowledgments: This work was done while the first author was on leave at the Department of Electrical and Electronic Engineering at the University of Melbourne. The research was supported by the Australian Research Council and research grants from NSERC, Canada and FCAR, Quebec. Les Cahiers du GERAD G–2003–51 1

[1]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[2]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[3]  Keith W. Ross Markov decision processes with constraints , 1984 .

[4]  H. Kushner,et al.  Weak convergence and asymptotic properties of adaptive filters with constant gains , 1984, IEEE Trans. Inf. Theory.

[5]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[6]  Michael N. Katehakis,et al.  The Multi-Armed Bandit Problem: Decomposition and Computation , 1987, Math. Oper. Res..

[7]  Keith W. Ross,et al.  Markov Decision Processes with Sample Path Constraints: The Communicating Case , 1989, Oper. Res..

[8]  Paul Glasserman,et al.  Gradient Estimation Via Perturbation Analysis , 1990 .

[9]  H. Kushner,et al.  Analysis of adaptive step size SA algorithms for parameter tracking , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[10]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Vikram Krishnamurthy,et al.  Iterative and recursive estimators for hidden Markov errors-in-variables models , 1996, IEEE Trans. Signal Process..

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  Lorne G. Mason,et al.  Adaptive decentralized control under non-uniqueness of the optimal control , 1996, Discret. Event Dyn. Syst..

[16]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[17]  Keith W. Ross,et al.  Multiservice Loss Models for Broadband Telecommunication Networks , 1997 .

[18]  A Orman,et al.  Optimization of Stochastic Models: The Interface Between Simulation and Optimization , 2012, J. Oper. Res. Soc..

[19]  E. Altman Constrained Markov Decision Processes , 1999 .

[20]  Felisa J. Vázquez-Abad,et al.  Strong points of weak convergence: a study using RPA gradient estimation for automatic learning, , 1999, Autom..

[21]  F. Vázquez-Abad,et al.  Measure valued differentiation for stochastic processes : the finite horizon case , 2000 .

[22]  Liyi Dai Perturbation analysis via coupling , 2000, IEEE Trans. Autom. Control..

[23]  J. Baxter,et al.  Direct gradient-based reinforcement learning , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[24]  Alexander S. Poznyak,et al.  Self-Learning Control of Finite Markov Chains , 2000 .

[25]  H. Vincent Poor,et al.  Integrated voice/data call admission control for wireless DS-CDMA systems , 2002, IEEE Trans. Signal Process..

[26]  V. Krishnamurthy,et al.  Implementation of gradient estimation to a constrained Markov decision problem , 2003, 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

[27]  David G. Luenberger,et al.  Linear and Nonlinear Programming: Second Edition , 2003 .

[28]  Paulo J. S. Silva,et al.  Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms , 2004, Numerical Algorithms.