Improving Safety in Deep Reinforcement Learning using Unsupervised Action Planning

One of the key challenges to deep reinforcement learning (deep RL) is to ensure safety at both training and testing phases. In this work, we propose a novel technique of unsupervised action planning to improve the safety of onpolicy reinforcement learning algorithms, such as trust region policy optimization (TRPO) or proximal policy optimization (PPO). We design our safety-aware reinforcement learning by storing all the history of “recovery” actions that rescue the agent from dangerous situations into a separate “safety” buffer and finding the best recovery action when the agent encounters similar states. Because this functionality requires the algorithm to query similar states, we implement the proposed safety mechanism using an unsupervised learning algorithm, k-means clustering. We evaluate the proposed algorithm on six robotic control tasks that cover navigation and manipulation. Our results show that the proposed safety RL algorithm can achieve higher rewards compared with multiple baselines in both discrete and continuous control problems. The supplemental video can be found at: https://youtu.be/AFTeWSohILo.

[1]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[2]  Torsten Koller,et al.  Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning , 2019, ArXiv.

[3]  Keith W. Kintigh,et al.  Heuristic Approaches to Spatial Analysis in Archaeology , 1982, American Antiquity.

[4]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[5]  E. Altman Constrained Markov Decision Processes , 1999 .

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Justinian Rosca,et al.  Accelerating Safe Reinforcement Learning with Constraint-mismatched Baseline Policies , 2020, ICML.

[8]  Yanjun Huang,et al.  Path Planning and Tracking for Vehicle Collision Avoidance Based on Model Predictive Control With Multiconstraints , 2017, IEEE Transactions on Vehicular Technology.

[9]  Mohammad Ghavamzadeh,et al.  Lyapunov-based Safe Policy Optimization for Continuous Control , 2019, ArXiv.

[10]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[11]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[12]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[14]  Daniel D. Lee,et al.  Learning Implicit Sampling Distributions for Motion Planning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[15]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[16]  Sergey Levine,et al.  Conservative Safety Critics for Exploration , 2021, ICLR.

[17]  Huimin Ma,et al.  Survival-Oriented Reinforcement Learning Model: An Effcient and Robust Deep Reinforcement Learning Algorithm for Autonomous Driving Problem , 2017, ICIG.

[18]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[19]  Eitan Altman,et al.  Constrained Markov decision processes with total cost criteria: Lagrangian approach and dual linear program , 1998, Math. Methods Oper. Res..

[20]  Brijen Thananjeyan,et al.  Recovery RL: Safe Reinforcement Learning With Learned Recovery Zones , 2020, IEEE Robotics and Automation Letters.

[21]  Sergey Levine,et al.  Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning , 2017, ICLR.

[22]  Mark H. Overmars,et al.  The Gaussian sampling strategy for probabilistic roadmap planners , 1999, Proceedings 1999 IEEE International Conference on Robotics and Automation (Cat. No.99CH36288C).

[23]  Sergey Levine,et al.  Learning to Walk in the Real World with Minimal Human Effort , 2020, CoRL.

[24]  Alekh Agarwal,et al.  Safe Reinforcement Learning via Curriculum Induction , 2020, NeurIPS.

[25]  Jianfeng Gao,et al.  Combating Reinforcement Learning's Sisyphean Curse with Intrinsic Fear , 2016, ArXiv.

[26]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[28]  Timothy Bretl,et al.  PoseRBPF: A Rao–Blackwellized Particle Filter for 6-D Object Pose Tracking , 2019, IEEE Transactions on Robotics.

[29]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[30]  FEW-SHOT LEARNING WITH WEAK SUPERVISION , 2021 .

[31]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[32]  Takamitsu Matsubara,et al.  Dynamic Actor-Advisor Programming for Scalable Safe Reinforcement Learning , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Sehoon Ha,et al.  Learning to be Safe: Deep RL with a Safety Critic , 2020, ArXiv.

[34]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[35]  Karthik Narasimhan,et al.  Projection-Based Constrained Policy Optimization , 2020, ICLR.

[36]  Sergey Levine,et al.  Learning compound multi-step controllers under unknown dynamics , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[37]  S. Shankar Sastry,et al.  Provably safe and robust learning-based model predictive control , 2011, Autom..

[38]  Francesco Borrelli,et al.  Predictive Control of Autonomous Ground Vehicles With Obstacle Avoidance on Slippery Roads , 2010 .

[39]  Harshit Sikchi,et al.  Lyapunov Barrier Policy Optimization , 2021, ArXiv.