Safe Exploration in Markov Decision Processes

In environments with uncertain dynamics exploration is necessary to learn how to perform well. Existing reinforcement learning algorithms provide strong exploration guarantees, but they tend to rely on an ergodicity assumption. The essence of ergodicity is that any state is eventually reachable from any other state by following a suitable policy. This assumption allows for exploration algorithms that operate by simply favoring states that have rarely been visited before. For most physical systems this assumption is impractical as the systems would break before any reasonable exploration has taken place, i.e., most physical systems don't satisfy the ergodicity assumption. In this paper we address the need for safe exploration methods in Markov decision processes. We first propose a general formulation of safety through ergodicity. We show that imposing safety by restricting attention to the resulting set of guaranteed safe policies is NP-hard. We then present an efficient algorithm for guaranteed safe, but potentially suboptimal, exploration. At the core is an optimization formulation in which the constraints restrict attention to a subset of the guaranteed safe policies and the objective favors exploration policies. Our framework is compatible with the majority of previously proposed exploration methods, which rely on an exploration bonus. Our experiments, which include a Martian terrain exploration problem, show that our method is able to explore better than classical exploration methods.

[1]  S. Marcus,et al.  Risk sensitive control of Markov processes in countable state space , 1996 .

[2]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[3]  Daniel Hernández-Hernández,et al.  Risk Sensitive Markov Decision Processes , 1997 .

[4]  D. Bertsekas Gradient convergence in gradient methods , 1997 .

[5]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[6]  E. Altman Constrained Markov Decision Processes , 1999 .

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  John N. Tsitsiklis,et al.  A survey of computational complexity results in systems and control , 2000, Autom..

[9]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[10]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[11]  John Langford,et al.  Exploration in Metric State Spaces , 2003, ICML.

[12]  A. McEwen,et al.  Mars Reconnaissance Orbiter's High Resolution Imaging Science Experiment (HiRISE) , 2007 .

[13]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[14]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[15]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  MSL Landing Site Selection User ’ s Guide to Engineering Constraints , 2006 .

[18]  M. Lockwood Introduction: Mars Science Laboratory: The Next Generation of Mars Landers , 2006 .

[19]  Shie Mannor,et al.  Percentile optimization in uncertain Markov decision processes with application to efficient exploration , 2007, ICML '07.

[20]  Thomas J. Walsh,et al.  Knows what it knows: a framework for self-aware learning , 2008, ICML '08.

[21]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[22]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[23]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[24]  Alborz Geramifard,et al.  UAV cooperative control with stochastic risk models , 2011, Proceedings of the 2011 American Control Conference.

[25]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[26]  Claire J. Tomlin,et al.  Guaranteed safe online learning of a bounded system , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Claire J. Tomlin,et al.  Extensions of learning-based model predictive control for real-time application to a quadrotor helicopter , 2012, 2012 American Control Conference (ACC).

[28]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.