An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes

We develop an online actor–critic reinforcement learning algorithm with function approximation for a problem of control under inequality constraints. We consider the long-run average cost Markov decision process (MDP) framework in which both the objective and the constraint functions are suitable policy-dependent long-run averages of certain sample path functions. The Lagrange multiplier method is used to handle the inequality constraints. We prove the asymptotic almost sure convergence of our algorithm to a locally optimal solution. We also provide the results of numerical experiments on a problem of routing in a multi-stage queueing network with constraints on long-run average queue lengths. We observe that our algorithm exhibits good performance on this setting and converges to a feasible point.

[1]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[2]  Aurel A. Lazar,et al.  Optimal flow control of a class of queueing networks in equilibrium , 1983 .

[3]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[4]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[5]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[6]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[7]  J. Ben Atkinson,et al.  An Introduction to Queueing Networks , 1988 .

[8]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[9]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[10]  Ilya Segal,et al.  Solutions manual for Microeconomic theory : Mas-Colell, Whinston and Green , 1997 .

[11]  Shalabh Bhatnagar,et al.  An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes , 2010, Syst. Control. Lett..

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  E. Altman Constrained Markov Decision Processes , 1999 .

[14]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[17]  Shalabh Bhatnagar,et al.  Natural actorcritic algorithms. , 2009 .

[18]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[19]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[20]  Shalabh Bhatnagar,et al.  The Borkar-Meyn theorem for asynchronous stochastic approximations , 2011, Syst. Control. Lett..

[21]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[22]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[23]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.