论文信息 - WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning - 字舞流文

WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-atRisk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.

Matthijs T. J. Spaan | Simon H. Tindemans | Thiago D. Simão | Qisong Yang | M. Spaan | Simon Tindemans | Qisong Yang | T. D. Simão

[1] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .

[2] Vivek S. Borkar,et al. An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[3] Li Xia,et al. Distributional Soft Actor Critic for Risk Sensitive Learning , 2020, ArXiv.

[4] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[5] Ruslan Salakhutdinov,et al. Worst Cases Policy Gradients , 2019, CoRL.

[6] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[7] Pieter Abbeel,et al. Constrained Policy Optimization , 2017, ICML.

[8] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[9] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[10] Sergey Levine,et al. Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[11] Shie Mannor,et al. Optimizing the CVaR via Sampling , 2014, AAAI.

[12] Dario Amodei,et al. Benchmarking Safe Exploration in Deep Reinforcement Learning , 2019 .

[13] Huaiyu Zhu. On Information and Sufficiency , 1997 .

[14] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[15] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[16] Jingliang Duan,et al. Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors. , 2021, IEEE transactions on neural networks and learning systems.

[17] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[18] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[19] Henry Zhu,et al. Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[20] Sergey Levine,et al. Learning to Walk in the Real World with Minimal Human Effort , 2020, CoRL.

[21] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[22] E. Altman. Constrained Markov Decision Processes , 1999 .

[23] V. Khokhlov. Conditional Value-at-Risk for Elliptical Distributions , 2018 .

[24] Klaus Obermayer,et al. Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[25] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[26] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[27] Dimitri P. Bertsekas,et al. Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[28] I. Olkin,et al. The distance between two random vectors with given dispersion matrices , 1982 .

[29] Ulrich Kamphausen. Continuous control. , 2012, Deutsches Arzteblatt international.

[30] Tomás Svoboda,et al. Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[31] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[32] Shie Mannor,et al. Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..