论文信息 - Fast Global Convergence of Policy Optimization for Constrained MDPs - 字舞流文

Fast Global Convergence of Policy Optimization for Constrained MDPs

We address the issue of safety in reinforcement learning. We pose the problem in a discounted infinite-horizon constrained Markov decision process framework. Existing results have shown that gradient-based methods are able to achieve an O(1/ √ T ) global convergence rate both for the optimality gap and the constraint violation. We exhibit a natural policy gradient-based algorithm that has a faster convergence rate O(log(T )/T ) for both the optimality gap and the constraint violation. When Slater’s condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large T while maintaining the same convergence rate for the optimality gap.

Dileep Kalathil | P. R. Kumar | Tao Liu | Ruida Zhou | Chao Tian | D. Kalathil | Tao Liu | Ruida Zhou | Chao Tian | P. Kumar

[1] Yuxin Chen,et al. Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[2] Dale Schuurmans,et al. On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[3] Pieter Abbeel,et al. Responsive Safety in Reinforcement Learning by PID Lagrangian Methods , 2020, ICML.

[4] Mihailo R. Jovanovic,et al. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes , 2020, NeurIPS.

[5] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[6] Xiaohu You,et al. A CMDP-based approach for energy efficient power allocation in massive MIMO systems , 2016, 2016 IEEE Wireless Communications and Networking Conference.

[7] Yuejie Chi,et al. Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence , 2021, ArXiv.

[8] Imre Csiszár,et al. Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[9] Hao Yu,et al. A Simple Parallel Algorithm with an O(1/t) Convergence Rate for General Convex Programs , 2015, SIAM J. Optim..

[10] M. Neely,et al. A Primal-Dual Parallel Method with $O(1/\epsilon)$ Convergence for Constrained Composite Convex Programs , 2017, 1708.00322.

[11] Shie Mannor,et al. Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[12] Gabriel Dulac-Arnold,et al. Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[13] Karthik Narasimhan,et al. Projection-Based Constrained Policy Optimization , 2020, ICLR.

[14] Guanghui Lan,et al. CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee , 2021, ICML.

[15] Sham M. Kakade,et al. On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[16] Xiaohan Wei,et al. Provably Efficient Safe Exploration via Primal-Dual Policy Optimization , 2021, AISTATS.

[17] Vicenç Gómez,et al. A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[18] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[19] Javad Lavaei,et al. A Dual Approach to Constrained Markov Decision Processes with Entropy Regularization , 2021, ArXiv.

[20] Qingkai Liang,et al. Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning , 2018, ArXiv.

[21] Prakirt Raj Jhunjhunwala,et al. On the Linear Convergence of Natural Policy Gradient Algorithm , 2021, 2021 60th IEEE Conference on Decision and Control (CDC).

[22] Ruida Zhou,et al. Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs , 2021, ArXiv.

[23] Xiaohan Wei,et al. Online Primal-Dual Mirror Descent under Stochastic Constraints , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[24] Shaofeng Zou,et al. Faster Algorithm and Sharper Analysis for Constrained Markov Decision Process , 2021, ArXiv.

[25] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[26] Shie Mannor,et al. Reward Constrained Policy Optimization , 2018, ICLR.

[27] Michael I. Jordan,et al. Advances in Neural Information Processing Systems 30 , 1995 .

[28] Luiz F. O. Chamon,et al. Safe Policies for Reinforcement Learning via Primal-Dual Methods , 2019, IEEE Transactions on Automatic Control.

[29] E. Altman. Constrained Markov Decision Processes , 1999 .