Safe Exploration by Solving Early Terminated MDP

Safe exploration is crucial for the real-world application of reinforcement learning (RL). Previous works consider the safe exploration problem as Constrained Markov Decision Process (CMDP), where the policies are being optimized under constraints. However, when encountering any potential dangers, human tends to stop immediately and rarely learns to behave safely in danger. Motivated by human learning, we introduce a new approach to address safe RL problems under the framework of Early Terminated MDP (ET-MDP). We first define the ET-MDP as an unconstrained MDP with the same optimal value function as its corresponding CMDP. An off-policy algorithm based on context models is then proposed to solve the ET-MDP, which thereby solves the corresponding CMDP with better asymptotic performance and improved learning efficiency. Experiments on various CMDP tasks show a substantial improvement over previous methods that directly solve CMDP.

[1]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[2]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[3]  Owain Evans,et al.  Trial without Error: Towards Safe Reinforcement Learning via Human Intervention , 2017, AAMAS.

[4]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  Florian Richter,et al.  Open-Sourced Reinforcement Learning Environments for Surgical Robotics , 2019, ArXiv.

[7]  Yisong Yue,et al.  Learning for Safety-Critical Control with Control Barrier Functions , 2019, L4DC.

[8]  Harshit Sikchi,et al.  Lyapunov Barrier Policy Optimization , 2021, ArXiv.

[9]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[10]  Pieter Abbeel,et al.  Responsive Safety in Reinforcement Learning by PID Lagrangian Methods , 2020, ICML.

[11]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[12]  Gábor Orosz,et al.  End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks , 2019, AAAI.

[13]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[14]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[15]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[16]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[17]  Don Joven Agravante,et al.  Constrained Exploration and Recovery from Experience Shaping , 2018, ArXiv.

[18]  Pieter Abbeel,et al.  Benchmarking Model-Based Reinforcement Learning , 2019, ArXiv.

[19]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[20]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[21]  Benjamin Rosman,et al.  Online Constrained Model-based Reinforcement Learning , 2017, UAI.

[22]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[23]  Sergey Levine,et al.  Conservative Safety Critics for Exploration , 2021, ICLR.

[24]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[25]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[26]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[27]  Sehoon Ha,et al.  Learning to be Safe: Deep RL with a Safety Critic , 2020, ArXiv.

[28]  David Janz,et al.  Learning to Drive in a Day , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[29]  C. Karen Liu,et al.  Visualizing Movement Control Optimization Landscapes , 2022, IEEE Transactions on Visualization and Computer Graphics.

[30]  Vitaly Levdik,et al.  Time Limits in Reinforcement Learning , 2017, ICML.

[31]  Yongshuai Liu,et al.  IPO: Interior-point Policy Optimization under Constraints , 2019, AAAI.

[32]  Zheng Wen,et al.  Efficient Exploration and Value Function Generalization in Deterministic Systems , 2013, NIPS.

[33]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[34]  Yanan Sui,et al.  Safe Reinforcement Learning in Constrained Markov Decision Processes , 2020, ICML.

[35]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[36]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[37]  Sham M. Kakade,et al.  On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift , 2019, J. Mach. Learn. Res..

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.