论文信息 - A novel Q-learning algorithm with function approximation for constrained Markov decision processes

A novel Q-learning algorithm with function approximation for constrained Markov decision processes

We present a novel multi-timescale Q-learning algorithm for average cost control in a Markov decision process subject to multiple inequality constraints. We formulate a relaxed version of this problem through the Lagrange multiplier method. Our algorithm is different from Q-learning in that it updates two parameters - a Q-value parameter and a policy parameter. The Q-value parameter is updated on a slower time scale as compared to the policy parameter. Whereas Q-learning with function approximation can diverge in some cases, our algorithm is seen to be convergent as a result of the aforementioned timescale separation. We show the results of experiments on a problem of constrained routing in a multistage queueing network. Our algorithm is seen to exhibit good performance and the various inequality constraints are seen to be satisfied upon convergence of the algorithm.

Shalabh Bhatnagar | K. Lakshmanan | S. Bhatnagar | K. Lakshmanan

[1] Benjamin Van Roy,et al. Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[2] Vivek S. Borkar,et al. Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[3] Michael C. Fu,et al. Two-timescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences , 2003, TOMC.

[4] Jean C. Walrand,et al. An introduction to queueing networks , 1989, Prentice Hall International editions.

[5] A. Mas-Colell,et al. Microeconomic Theory , 1995 .

[6] E. Altman. Constrained Markov Decision Processes , 1999 .

[7] John N. Tsitsiklis,et al. Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[8] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[9] An Online Convergent Q-learning Algorithm with Linear Function Approximation , 2011 .

[10] John N. Tsitsiklis,et al. Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[11] Francisco S. Melo,et al. Q -Learning with Linear Function Approximation , 2007, COLT.

[12] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13] Dimitri P. Bertsekas,et al. Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[14] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[15] Shalabh Bhatnagar,et al. An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[16] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Vol. II , 1976 .

[18] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[19] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[21] James C. Spall,et al. A one-measurement form of simultaneous perturbation stochastic approximation , 1997, Autom..

[22] Shalabh Bhatnagar,et al. An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes , 2010, Syst. Control. Lett..

[23] Vivek S. Borkar,et al. An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[24] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.