Projection-Based Constrained Policy Optimization

In this paper, we consider the problem of learning control policies that optimize areward function while satisfying constraints due to considerations of safety, fairness, or other costs. We propose a new algorithm - Projection Based ConstrainedPolicy Optimization (PCPO), an iterative method for optimizing policies in a two-step process - the first step performs an unconstrained update while the secondstep reconciles the constraint violation by projection the policy back onto the constraint set. We theoretically analyze PCPO and provide a lower bound on rewardimprovement, as well as an upper bound on constraint violation for each policy update. We further characterize the convergence of PCPO with projection basedon two different metrics - L2 norm and Kullback-Leibler divergence. Our empirical results over several control tasks demonstrate that our algorithm achievessuperior performance, averaging more than 3.5 times less constraint violation andaround 15% higher reward compared to state-of-the-art methods.

[1]  Mohammad Ghavamzadeh,et al.  Lyapunov-based Safe Policy Optimization for Continuous Control , 2019, ArXiv.

[2]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[3]  Alexandre M. Bayen,et al.  Benchmarks for reinforcement learning in mixed-autonomy traffic , 2018, CoRL.

[4]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[5]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[6]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[7]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[8]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[11]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[12]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[13]  E. Altman Constrained Markov Decision Processes , 1999 .

[14]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[15]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[16]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[17]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[18]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[19]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[20]  Jan Peters,et al.  Projections for Approximate Policy Iteration Algorithms , 2019, ICML.