WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-atRisk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.

[1]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[2]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[3]  Li Xia,et al.  Distributional Soft Actor Critic for Risk Sensitive Learning , 2020, ArXiv.

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Ruslan Salakhutdinov,et al.  Worst Cases Policy Gradients , 2019, CoRL.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[7]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[8]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[9]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[10]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[11]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[12]  Dario Amodei,et al.  Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[13]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[14]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[15]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16]  Jingliang Duan,et al.  Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors. , 2021, IEEE transactions on neural networks and learning systems.

[17]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[19]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[20]  Sergey Levine,et al.  Learning to Walk in the Real World with Minimal Human Effort , 2020, CoRL.

[21]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22]  E. Altman Constrained Markov Decision Processes , 1999 .

[23]  V. Khokhlov Conditional Value-at-Risk for Elliptical Distributions , 2018 .

[24]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[25]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[26]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[27]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[28]  I. Olkin,et al.  The distance between two random vectors with given dispersion matrices , 1982 .

[29]  Ulrich Kamphausen Continuous control. , 2012, Deutsches Arzteblatt international.

[30]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[31]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[32]  Shie Mannor,et al.  Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..