Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning

Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in realworld constraint violations even for actions that are predicted to satisfy all constraints. We propose Conservative and Adaptive Penalty (CAP), a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives. First, CAP inflates predicted costs using an uncertainty-based penalty. Theoretically, we show that policies that satisfy this conservative cost constraint are guaranteed to also be feasible in the true environment. We further show that this guarantees the safety of all intermediate solutions during RL training. Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment. We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms. Code is available at: https://github.com/Redrew/CAP

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  K. Ross,et al.  First Order Constrained Optimization in Policy Space , 2020, NeurIPS.

[3]  Shuo Li,et al.  Robust Model Predictive Shielding for Safe Reinforcement Learning with Stochastic Dynamics , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[4]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[5]  Ufuk Topcu,et al.  Constrained Cross-Entropy Method for Safe Reinforcement Learning , 2020, IEEE Transactions on Automatic Control.

[6]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[7]  C. Knospe,et al.  PID control , 2006, IEEE Control Systems.

[8]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[9]  Mohammad Ghavamzadeh,et al.  Lyapunov-based Safe Policy Optimization for Continuous Control , 2019, ArXiv.

[10]  Lantao Yu,et al.  MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.

[11]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[12]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[13]  Mo Chen,et al.  Hamilton-Jacobi reachability: A brief overview and recent advances , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[14]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[15]  Osbert Bastani,et al.  Conservative Offline Distributional Reinforcement Learning , 2021, NeurIPS.

[16]  Jaime F. Fisac,et al.  Reachability-based safe learning with Gaussian processes , 2014, 53rd IEEE Conference on Decision and Control.

[17]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[18]  Torsten Koller,et al.  Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning , 2019, ArXiv.

[19]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[20]  Chelsea Finn,et al.  Cautious Adaptation For Reinforcement Learning in Safety-Critical Settings , 2020, ICML.

[21]  Richard S. Sutton,et al.  Planning by Incremental Dynamic Programming , 1991, ML.

[22]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[23]  Sergey Levine,et al.  Uncertainty-Aware Reinforcement Learning for Collision Avoidance , 2017, ArXiv.

[24]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[25]  Angela P. Schoellig,et al.  Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning , 2021, Annu. Rev. Control. Robotics Auton. Syst..

[26]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[27]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[28]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[29]  Osbert Bastani,et al.  Safe Reinforcement Learning with Nonlinear Dynamics via Model Predictive Shielding , 2021, 2021 American Control Conference (ACC).

[30]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[31]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[32]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[33]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[35]  E. Altman Constrained Markov Decision Processes , 1999 .