Safe Policies for Reinforcement Learning via Primal-Dual Methods

In this paper, we study the learning of safe policies in the setting of reinforcement learning problems. This is, we aim to control a Markov Decision Process (MDP) of which we do not know the transition probabilities, but we have access to sample trajectories through experience. We define safety as the agent remaining in a desired safe set with high probability during the operation time. We therefore consider a constrained MDP where the constraints are probabilistic. Since there is no straightforward way to optimize the policy with respect to the probabilistic constraint in a reinforcement learning framework, we propose an ergodic relaxation of the problem. The advantages of the proposed relaxation are threefold. (i) The safety guarantees are maintained in the case of episodic tasks and they are kept up to a given time horizon for continuing tasks. (ii) The constrained optimization problem despite its non-convexity has arbitrarily small duality gap if the parametrization of the policy is rich enough. (iii) The gradients of the Lagrangian associated with the safe-learning problem can be easily computed using standard policy gradient results and stochastic approximation tools. Leveraging these advantages, we establish that primal-dual algorithms are able to find policies that are safe and optimal. We test the proposed approach in a navigation task in a continuous domain. The numerical results show that our algorithm is capable of dynamically adapting the policy to the environment and the required safety levels.

[1]  Alejandro Ribeiro,et al.  Online Learning of Feasible Strategies in Unknown Environments , 2016, IEEE Transactions on Automatic Control.

[2]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[3]  Marcus Hutter,et al.  Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures , 2002, COLT.

[4]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[5]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[6]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[7]  H. Robbins A Stochastic Approximation Method , 1951 .

[8]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[9]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[10]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[11]  R. Durrett Probability: Theory and Examples , 1993 .

[12]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[13]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[14]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[15]  Peter Geibel,et al.  Reinforcement Learning for MDPs with Constraints , 2006, ECML.

[16]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[17]  Alejandro Ribeiro,et al.  Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces , 2018, IEEE Transactions on Automatic Control.

[18]  Miklós Rásonyi,et al.  ON UTILITY MAXIMIZATION IN DISCRETE-TIME FINANCIAL MARKET MODELS , 2005 .

[19]  Dimitri P. Bertsekas,et al.  Convex Optimization Algorithms , 2015 .

[20]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[21]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[22]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[23]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[24]  Masami Yasuda,et al.  Discounted Markov decision processes with utility constraints , 2006, Comput. Math. Appl..

[25]  Torsten Koller,et al.  Learning-based Model Predictive Control for Safe Exploration and Reinforcement Learning , 2019, ArXiv.

[26]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[27]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[28]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[29]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[30]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[31]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[32]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[33]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[34]  FernándezFernando,et al.  A comprehensive survey on safe reinforcement learning , 2015 .

[35]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[36]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[37]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[38]  Qing Ling,et al.  An Online Convex Optimization Approach to Proactive Network Resource Allocation , 2017, IEEE Transactions on Signal Processing.

[39]  David Q. Mayne,et al.  Constrained model predictive control: Stability and optimality , 2000, Autom..

[40]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[41]  D. Bertsekas,et al.  Alternative theoretical frameworks for finite horizon discrete-time stochastic optimal control , 1977, 1977 IEEE Conference on Decision and Control including the 16th Symposium on Adaptive Processes and A Special Symposium on Fuzzy Set Theory and Applications.

[42]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[43]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[44]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[45]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[46]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[47]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[48]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[49]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[50]  Razvan Pascanu,et al.  Ray Interference: a Source of Plateaus in Deep Reinforcement Learning , 2019, ArXiv.

[51]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[52]  Axel van Lamsweerde,et al.  Learning machine learning , 1991 .

[53]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[54]  Alejandro Ribeiro,et al.  Learning Safe Policies via Primal-Dual Methods , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).