A Lyapunov-based Approach to Safe Reinforcement Learning

In many real-world reinforcement learning (RL) problems, besides optimizing the main objective function, an agent must concurrently avoid violating a number of constraints. In particular, besides optimizing performance it is crucial to guarantee the safety of an agent during training as well as deployment (e.g. a robot should avoid taking actions - exploratory or not - which irrevocably harm its hardware). To incorporate safety in RL, we derive algorithms under the framework of constrained Markov decision problems (CMDPs), an extension of the standard Markov decision problems (MDPs) augmented with constraints on expected cumulative costs. Our approach hinges on a novel \emph{Lyapunov} method. We define and present a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints. Leveraging these theoretical underpinnings, we show how to use the Lyapunov approach to systematically transform dynamic programming (DP) and RL algorithms into their safe counterparts. To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Eitan Altman,et al.  Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program , 1998, Math. Methods Oper. Res..

[3]  Konkoly Thege Multi-criteria Reinforcement Learning , 1998 .

[4]  E. Altman Constrained Markov Decision Processes , 1999 .

[5]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[6]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[7]  Guy Shani,et al.  An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[8]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[9]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10]  Michael Schmitt,et al.  On the Complexity of Learning Lexicographic Strategies , 2006, J. Mach. Learn. Res..

[11]  P. Glynn,et al.  Bounding Stationary Expectations of Markov Processes , 2008 .

[12]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[13]  Naoki Abe,et al.  Optimizing debt collections using constrained reinforcement learning , 2010, KDD.

[14]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[15]  Mihaela van der Schaar,et al.  Fast Reinforcement Learning for Energy-Efficient Wireless Communication , 2010, IEEE Transactions on Signal Processing.

[16]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[17]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[18]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[19]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[20]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[21]  Bruno Scherrer,et al.  Performance bounds for λ policy iteration and application to the game of Tetris , 2013, J. Mach. Learn. Res..

[22]  Luca Bascetta,et al.  Adaptive Step-Size for Policy Gradient Methods , 2013, NIPS.

[23]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[24]  Marco Pavone,et al.  Chance-constrained dynamic programming with application to risk-aware robotic space exploration , 2015, Autonomous Robots.

[25]  Brian M. Sadler,et al.  Trading Safety Versus Performance: Rapid Deployment of Robotic Swarms with Robust Performance Constraints , 2015, ArXiv.

[26]  Razvan Pascanu,et al.  Policy Distillation , 2015, ICLR.

[27]  Behçet Açikmese,et al.  Convex synthesis of randomized policies for controlled Markov chains with density safety upper bound constraints , 2016, 2016 American Control Conference (ACC).

[28]  Shimon Whiteson,et al.  Multi-Objective Deep Reinforcement Learning , 2016, ArXiv.

[29]  Sebastian Junges,et al.  Safety-Constrained Reinforcement Learning for MDPs , 2015, TACAS.

[30]  Phillipp Kaestner,et al.  Linear And Nonlinear Programming , 2016 .

[31]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[32]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[33]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[34]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.

[35]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[36]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[37]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[38]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[39]  Alessandro Lazaric,et al.  Exploration – Exploitation in MDPs with Options , 2016 .

[40]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[41]  Michael I. Jordan,et al.  First-order methods almost always avoid saddle points: The case of vanishing step-sizes , 2019, NeurIPS.