论文信息 - Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes - 字舞流文

Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes

We present a reinforcement learning approach to explore and optimize a safety-constrained Markov Decision Process (MDP). In this setting, the agent must maximize discounted cumulative reward while constraining the probability of entering unsafe states, defined using a safety function being within some tolerance. The safety values of all states are not known a priori, and we probabilistically model them via a Gaussian Process (GP) prior. As such, properly behaving in such an environment requires balancing a three-way trade-off of exploring the safety function, exploring the reward function, and exploiting acquired knowledge to maximize reward. We propose a novel approach to balance this trade-off. Specifically, our approach explores unvisited states selectively; that is, it prioritizes the exploration of a state if visiting that state significantly improves the knowledge on the achievable cumulative reward. Our approach relies on a novel information gain criterion based on Gaussian Process representations of the reward and safety functions. We demonstrate the effectiveness of our approach on a range of experiments, including a simulation using the real Martian terrain data.

Yisong Yue | Masahiro Ono | Yanan Sui | Akifumi Wachi | Yisong Yue | Yanan Sui | M. Ono | Akifumi Wachi

[1] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[2] S. Ghosal,et al. Posterior consistency of Gaussian process prior for nonparametric binary regression , 2006, math/0702686.

[3] A. McEwen,et al. Mars Reconnaissance Orbiter's High Resolution Imaging Science Experiment (HiRISE) , 2007 .

[4] Marco Pavone,et al. Chance-constrained dynamic programming with application to risk-aware robotic space exploration , 2015, Autonomous Robots.

[5] Pieter Abbeel,et al. Safe Exploration in Markov Decision Processes , 2012, ICML.

[6] Olivier Buffet,et al. Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[7] Alkis Gotovos,et al. Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[8] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[9] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[10] W. Fleming,et al. Risk-Sensitive Control on an Infinite Time Horizon , 1995 .

[11] J. Mockus. Bayesian Approach to Global Optimization: Theory and Applications , 1989 .

[12] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[13] John N. Tsitsiklis,et al. Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[14] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[17] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[18] David Q. Mayne,et al. Constrained model predictive control: Stability and optimality , 2000, Autom..

[19] Sham M. Kakade,et al. On the sample complexity of reinforcement learning. , 2003 .

[20] Andreas Krause,et al. Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[21] Michael Nikolaou,et al. Chance‐constrained model predictive control , 1999 .

[22] Masahiro Ono,et al. A Probabilistic Particle-Control Approximation of Chance-Constrained Stochastic Predictive Control , 2010, IEEE Transactions on Robotics.