Fast Global Convergence of Policy Optimization for Constrained MDPs

We address the issue of safety in reinforcement learning. We pose the problem in a discounted infinite-horizon constrained Markov decision process framework. Existing results have shown that gradient-based methods are able to achieve an O(1/ √ T ) global convergence rate both for the optimality gap and the constraint violation. We exhibit a natural policy gradient-based algorithm that has a faster convergence rate O(log(T )/T ) for both the optimality gap and the constraint violation. When Slater’s condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large T while maintaining the same convergence rate for the optimality gap.

[1]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[2]  Dale Schuurmans,et al.  On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[3]  Pieter Abbeel,et al.  Responsive Safety in Reinforcement Learning by PID Lagrangian Methods , 2020, ICML.

[4]  Mihailo R. Jovanovic,et al.  Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes , 2020, NeurIPS.

[5]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[6]  Xiaohu You,et al.  A CMDP-based approach for energy efficient power allocation in massive MIMO systems , 2016, 2016 IEEE Wireless Communications and Networking Conference.

[7]  Yuejie Chi,et al.  Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence , 2021, ArXiv.

[8]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[9]  Hao Yu,et al.  A Simple Parallel Algorithm with an O(1/t) Convergence Rate for General Convex Programs , 2015, SIAM J. Optim..

[10]  M. Neely,et al.  A Primal-Dual Parallel Method with $O(1/\epsilon)$ Convergence for Constrained Composite Convex Programs , 2017, 1708.00322.

[11]  Shie Mannor,et al.  Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[12]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[13]  Karthik Narasimhan,et al.  Projection-Based Constrained Policy Optimization , 2020, ICLR.

[14]  Guanghui Lan,et al.  CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee , 2021, ICML.

[15]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[16]  Xiaohan Wei,et al.  Provably Efficient Safe Exploration via Primal-Dual Policy Optimization , 2021, AISTATS.

[17]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[18]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[19]  Javad Lavaei,et al.  A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization , 2021, ArXiv.

[20]  Qingkai Liang,et al.  Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning , 2018, ArXiv.

[21]  Prakirt Raj Jhunjhunwala,et al.  On the Linear Convergence of Natural Policy Gradient Algorithm , 2021, 2021 60th IEEE Conference on Decision and Control (CDC).

[22]  Ruida Zhou,et al.  Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs , 2021, ArXiv.

[23]  Xiaohan Wei,et al.  Online Primal-Dual Mirror Descent under Stochastic Constraints , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[24]  Shaofeng Zou,et al.  Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process , 2021, ArXiv.

[25]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[26]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[27]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[28]  Luiz F. O. Chamon,et al.  Safe Policies for Reinforcement Learning via Primal-Dual Methods , 2019, IEEE Transactions on Automatic Control.

[29]  E. Altman Constrained Markov Decision Processes , 1999 .