A novel Q-learning algorithm with function approximation for constrained Markov decision processes

We present a novel multi-timescale Q-learning algorithm for average cost control in a Markov decision process subject to multiple inequality constraints. We formulate a relaxed version of this problem through the Lagrange multiplier method. Our algorithm is different from Q-learning in that it updates two parameters - a Q-value parameter and a policy parameter. The Q-value parameter is updated on a slower time scale as compared to the policy parameter. Whereas Q-learning with function approximation can diverge in some cases, our algorithm is seen to be convergent as a result of the aforementioned timescale separation. We show the results of experiments on a problem of constrained routing in a multistage queueing network. Our algorithm is seen to exhibit good performance and the various inequality constraints are seen to be satisfied upon convergence of the algorithm.

[1]  Benjamin Van Roy,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[2]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[3]  Michael C. Fu,et al.  Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences , 2003, TOMC.

[4]  Jean C. Walrand,et al.  An introduction to queueing networks , 1989, Prentice Hall International editions.

[5]  A. Mas-Colell,et al.  Microeconomic Theory , 1995 .

[6]  E. Altman Constrained Markov Decision Processes , 1999 .

[7]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[9]  An Online Convergent Q-learning Algorithm with Linear Function Approximation , 2011 .

[10]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[11]  Francisco S. Melo,et al.  Q -Learning with Linear Function Approximation , 2007, COLT.

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[14]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[15]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[18]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[21]  James C. Spall,et al.  A one-measurement form of simultaneous perturbation stochastic approximation , 1997, Autom..

[22]  Shalabh Bhatnagar,et al.  An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes , 2010, Syst. Control. Lett..

[23]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[24]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.