Safe Exploration in Finite Markov Decision Processes with Gaussian Processes

In classical reinforcement learning, when exploring an environment, agents accept arbitrary short term loss for long term gain. This is infeasible for safety critical applications, such as robotics, where even a single unsafe action may cause system failure. In this paper, we address the problem of safely exploring finite Markov decision processes (MDP). We define safety in terms of an, a priori unknown, safety constraint that depends on states and actions. We aim to explore the MDP under this constraint, assuming that the unknown function satisfies regularity conditions expressed via a Gaussian process prior. We develop a novel algorithm for this task and prove that it is able to completely explore the safely reachable part of the MDP without violating the safety constraint. To achieve this, it cautiously explores safe states and actions in order to gain statistical confidence about the safety of unvisited state-action pairs from noisy observations collected while navigating the environment. Moreover, the algorithm explicitly considers reachability when exploring the MDP, ensuring that it does not get stuck in any state with no safe way out. We demonstrate our method on digital terrain models for the task of exploring an unknown map with a rover.

[1]  Javier García,et al.  Safe Exploration of State and Action Spaces in Reinforcement Learning , 2012, J. Artif. Intell. Res..

[2]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[3]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[4]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[5]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[6]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[7]  MSL Landing Site Selection User ’ s Guide to Engineering Constraints , 2006 .

[8]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[9]  Stefan Schaal,et al.  Learning Control in Robotics , 2010, IEEE Robotics & Automation Magazine.

[10]  Angela P. Schoellig,et al.  Safe and robust learning control with Gaussian processes , 2015, 2015 European Control Conference (ECC).

[11]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[12]  Duy Nguyen-Tuong,et al.  Safe Exploration for Active Learning with Gaussian Processes , 2015, ECML/PKDD.

[13]  S. Ghosal,et al.  Posterior consistency of Gaussian process prior for nonparametric binary regression , 2006, math/0702686.

[14]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[15]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[16]  Jaime F. Fisac,et al.  Reachability-based safe learning with Gaussian processes , 2014, 53rd IEEE Conference on Decision and Control.

[17]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[18]  M. Lockwood Introduction: Mars Science Laboratory: The Next Generation of Mars Landers , 2006 .

[19]  A. McEwen,et al.  Mars Reconnaissance Orbiter's High Resolution Imaging Science Experiment (HiRISE) , 2007 .

[20]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[21]  Alkis Gotovos,et al.  Safe Exploration for Optimization with Gaussian Processes , 2015, ICML.

[22]  Jan Hodicky,et al.  Modelling and Simulation for Autonomous Systems , 2014, Lecture Notes in Computer Science.

[23]  Andreas Krause,et al.  Safe controller optimization for quadrotors with Gaussian processes , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).