Learning Safe Policies via Primal-Dual Methods

In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experiments. We define safety as the agent remaining in a desired safe set with high probability for every time instance. We therefore consider a constrained MDP where the constraints are probabilistic. Due to the difficulty of addressing these constraints in a reinforcement learning framework, we propose an ergodic relaxation of the problem. Nonetheless, this relaxation is such that we are able to provide safety guarantees on the resulting policies. To compute these policies, we resource to a stochastic primal-dual method. We test the proposed approach in a navigation task in a grid world. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.

[1]  D. Bertsekas,et al.  Alternative theoretical frameworks for finite horizon discrete-time stochastic optimal control , 1977, 1977 IEEE Conference on Decision and Control including the 16th Symposium on Adaptive Processes and A Special Symposium on Fuzzy Set Theory and Applications.

[2]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[3]  Miklós Rásonyi,et al.  ON UTILITY MAXIMIZATION IN DISCRETE-TIME FINANCIAL MARKET MODELS , 2005 .

[4]  Alejandro Ribeiro,et al.  Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces , 2018, IEEE Transactions on Automatic Control.

[5]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[6]  Masami Yasuda,et al.  Discounted Markov decision processes with utility constraints , 2006, Comput. Math. Appl..

[7]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[10]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[11]  David Q. Mayne,et al.  Constrained model predictive control: Stability and optimality , 2000, Autom..

[12]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[15]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[16]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[17]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[18]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[19]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[20]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[21]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[22]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[23]  Marcus Hutter,et al.  Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[24]  Peter Geibel,et al.  Reinforcement Learning for MDPs with Constraints , 2006, ECML.