Constrained Policy Optimization

For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016, Schulman et al., 2015, Lillicrap et al., 2016, Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them. We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety.

[1]  Moshe Haviv,et al.  On constrained Markov decision processes , 1996, Oper. Res. Lett..

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  E. Altman Constrained Markov Decision Processes , 1999 .

[4]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[5]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[6]  E. Uchibe,et al.  Constrained reinforcement learning from intrinsic and extrinsic rewards , 2007, 2007 IEEE 6th International Conference on Development and Learning.

[7]  Stephen P. Boyd,et al.  Subgradient Methods , 2007 .

[8]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[9]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[10]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[11]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[12]  E. Lukács Probability and Mathematical Statistics: An Introduction , 2014 .

[13]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[14]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  Eric Eaton,et al.  Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret , 2015, ICML.

[17]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[18]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[19]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[20]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[21]  Amnon Shashua,et al.  Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving , 2016, ArXiv.

[22]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[23]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[24]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[25]  Pieter Abbeel,et al.  Probabilistically safe policy transfer , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[27]  Zachary Chase Lipton,et al.  Combating Deep Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, 1611.01211.

[28]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[29]  Sen Wang,et al.  Deep Reinforcement Learning for Autonomous Driving , 2018, ArXiv.