Safe Model-based Reinforcement Learning with Stability Guarantees

Reinforcement learning is a powerful paradigm for learning optimal policies from experimental data. However, to find optimal policies, most reinforcement learning algorithms explore all possible actions, which may be harmful for real-world systems. As a consequence, learning algorithms are rarely applied on safety-critical systems in the real world. In this paper, we present a learning algorithm that explicitly considers safety, defined in terms of stability guarantees. Specifically, we extend control-theoretic results on Lyapunov stability verification and show how to use statistical models of the dynamics to obtain high-performance control policies with provable stability certificates. Moreover, under additional regularity assumptions in terms of a Gaussian process prior, we prove that one can effectively and safely collect data in order to learn about the dynamics and thus both improve control performance and expand the safe region of the state space. In our experiments, we show how the resulting algorithm can safely optimize a neural network policy on a simulated inverted pendulum, without the pendulum ever falling down.

[1]  Scott Davies,et al.  Multidimensional Triangulation and Interpolation for Reinforcement Learning , 1996, NIPS.

[2]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[3]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[4]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[5]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[9]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[10]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[11]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[12]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[13]  Javier García,et al.  Safe Exploration of State and Action Spaces in Reinforcement Learning , 2012, J. Artif. Intell. Res..

[14]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[15]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[16]  S. Shankar Sastry,et al.  Provably safe and robust learning-based model predictive control , 2011, Autom..

[17]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.

[18]  Jaime F. Fisac,et al.  Reachability-based safe learning with Gaussian processes , 2014, 53rd IEEE Conference on Decision and Control.

[19]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[20]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[21]  P. Giesl,et al.  Review on computational methods for Lyapunov functions , 2015 .

[22]  Duy Nguyen-Tuong,et al.  Safe Exploration for Active Learning with Gaussian Processes , 2015, ECML/PKDD.

[23]  Ashish Kapoor,et al.  Safe Control under Uncertainty , 2015, ArXiv.

[24]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[25]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[26]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[27]  Duy Nguyen-Tuong,et al.  Stability of Controllers for Gaussian Process Forward Models , 2016, ICML.

[28]  Mircea Lazar,et al.  A sampling approach to finding Lyapunov functions for nonlinear discrete-time systems , 2016, 2016 European Control Conference (ECC).

[29]  Angela P. Schoellig,et al.  Robust Constrained Learning-based NMPC enabling reliable mobile robot path tracking , 2016, Int. J. Robotics Res..

[30]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[31]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[32]  Ashish Kapoor,et al.  Safe Control under Uncertainty with Probabilistic Signal Temporal Logic , 2016, Robotics: Science and Systems.

[33]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[34]  Andreas Krause,et al.  Safe controller optimization for quadrotors with Gaussian processes , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[35]  L. Grüne,et al.  Computation of local ISS Lyapunov functions for discrete-time systems via linear programming , 2016 .

[36]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[37]  Aditya Gopalan,et al.  On Kernelized Multi-armed Bandits , 2017, ICML.

[38]  Alexis Boukouvalas,et al.  GPflow: A Gaussian Process Library using TensorFlow , 2016, J. Mach. Learn. Res..