Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes

We present a reinforcement learning approach to explore and optimize a safety-constrained Markov Decision Process (MDP). In this setting, the agent must maximize discounted cumulative reward while constraining the probability of entering unsafe states, defined using a safety function being within some tolerance. The safety values of all states are not known a priori, and we probabilistically model them via a Gaussian Process (GP) prior. As such, properly behaving in such an environment requires balancing a three-way trade-off of exploring the safety function, exploring the reward function, and exploiting acquired knowledge to maximize reward. We propose a novel approach to balance this trade-off. Specifically, our approach explores unvisited states selectively; that is, it prioritizes the exploration of a state if visiting that state significantly improves the knowledge on the achievable cumulative reward. Our approach relies on a novel information gain criterion based on Gaussian Process representations of the reward and safety functions. We demonstrate the effectiveness of our approach on a range of experiments, including a simulation using the real Martian terrain data.

[1]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[2]  S. Ghosal,et al.  Posterior consistency of Gaussian process prior for nonparametric binary regression , 2006, math/0702686.

[3]  A. McEwen,et al.  Mars Reconnaissance Orbiter's High Resolution Imaging Science Experiment (HiRISE) , 2007 .

[4]  Marco Pavone,et al.  Chance-constrained dynamic programming with application to risk-aware robotic space exploration , 2015, Autonomous Robots.

[5]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[6]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[7]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[8]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[9]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[10]  W. Fleming,et al.  Risk-Sensitive Control on an Infinite Time Horizon , 1995 .

[11]  J. Mockus Bayesian Approach to Global Optimization: Theory and Applications , 1989 .

[12]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[13]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[14]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[17]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[18]  David Q. Mayne,et al.  Constrained model predictive control: Stability and optimality , 2000, Autom..

[19]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[20]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[21]  Michael Nikolaou,et al.  Chance‐constrained model predictive control , 1999 .

[22]  Masahiro Ono,et al.  A Probabilistic Particle-Control Approximation of Chance-Constrained Stochastic Predictive Control , 2010, IEEE Transactions on Robotics.