Safe Reinforcement Learning of Control-Affine Systems with Vertex Networks

This paper focuses on finding reinforcement learning policies for control systems with hard state and action constraints. Despite its success in many domains, reinforcement learning is challenging to apply to problems with hard constraints, especially if both the state variables and actions are constrained. Previous works seeking to ensure constraint satisfaction, or safety, have focused on adding a projection step to a learned policy. Yet, this approach requires solving an optimization problem at every policy execution step, which can lead to significant computational costs. To tackle this problem, this paper proposes a new approach, termed Vertex Networks (VNs), with guarantees on safety during exploration and on learned control policies by incorporating the safety constraints into the policy network architecture. Leveraging the geometric property that all points within a convex set can be represented as the convex combination of its vertices, the proposed algorithm first learns the convex combination weights and then uses these weights along with the pre-calculated vertices to output an action. The output action is guaranteed to be safe by construction. Numerical examples illustrate that the proposed VN algorithm outperforms vanilla reinforcement learning in a variety of benchmark control tasks.

[1]  Mladen Kolar,et al.  Convergent Policy Optimization for Safe Reinforcement Learning , 2019, NeurIPS.

[2]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[3]  Tianshu Chu,et al.  Safe Reinforcement Learning: Learning with Supervision Using a Constraint-Admissible Set , 2018, 2018 Annual American Control Conference (ACC).

[4]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[5]  Franco Blanchini,et al.  Set invariance in control , 1999, Autom..

[6]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[7]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[8]  Yisong Yue,et al.  Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes , 2018, AAAI.

[9]  Kim Peter Wabersich,et al.  Linear Model Predictive Safety Certification for Learning-Based Control , 2018, 2018 IEEE Conference on Decision and Control (CDC).

[10]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[11]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[12]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[13]  Jaime F. Fisac,et al.  Reachability-based safe learning with Gaussian processes , 2014, 53rd IEEE Conference on Decision and Control.

[14]  Baosen Zhang,et al.  Lyapunov-Regularized Reinforcement Learning for Power System Transient Stability , 2021, IEEE Control Systems Letters.

[15]  K. T. Tan,et al.  Linear systems with state and control constraints: the theory and application of maximal output admissible sets , 1991 .

[16]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[17]  Semyon M. Meerkov,et al.  An LQR/LQG theory for systems with saturating actuators , 2001, IEEE Trans. Autom. Control..

[18]  E. Altman Constrained Markov Decision Processes , 1999 .

[19]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[20]  Franco Blanchini,et al.  Set-theoretic methods in control , 2007 .

[21]  Etienne Perot,et al.  Deep Reinforcement Learning framework for Autonomous Driving , 2017, Autonomous Vehicles and Machines.

[22]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[23]  João Pedro Hespanha,et al.  Linear Systems Theory , 2009 .

[24]  V. Broman,et al.  A compact algorithm for the intersection and approximation of N -dimensional polytopes , 1990 .

[25]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[26]  Kim Peter Wabersich,et al.  Safe exploration of nonlinear dynamical systems: A predictive safety filter for reinforcement learning , 2018, ArXiv.

[27]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[28]  Yisong Yue,et al.  Learning for Safety-Critical Control with Control Barrier Functions , 2019, L4DC.

[29]  Jaime F. Fisac,et al.  A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems , 2017, IEEE Transactions on Automatic Control.

[30]  Baosen Zhang,et al.  Reinforcement Learning for Optimal Frequency Control: A Lyapunov Approach , 2020, ArXiv.

[31]  Hans Raj Tiwary On the Hardness of Computing Intersection, Union and Minkowski Sum of Polytopes , 2008, Discret. Comput. Geom..

[32]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[33]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[34]  E. Gilbert,et al.  Theory and computation of disturbance invariant sets for discrete-time linear systems , 1998 .