Efficient and Safe Exploration in Deterministic Markov Decision Processes with Unknown Transition Models

We propose a safe exploration algorithm for deterministic Markov Decision Processes with unknown transition models. Our algorithm guarantees safety by leveraging Lipschitz-continuity to ensure that no unsafe states are visited during exploration. Unlike many other existing techniques, the provided safety guarantee is deterministic. Our algorithm is optimized to reduce the number of actions needed for exploring the safe space. We demonstrate the performance of our algorithm in comparison with baseline methods in simulation on navigation tasks.

[1]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[2]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[3]  Steven I. Marcus,et al.  Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes , 1999, Autom..

[4]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[5]  Mykel J. Kochenderfer,et al.  Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks , 2017, CAV.

[6]  Ashish Kapoor,et al.  Safe Control under Uncertainty with Probabilistic Signal Temporal Logic , 2016, Robotics: Science and Systems.

[7]  Andreas Krause,et al.  Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics , 2016, Machine Learning.

[8]  S. Shankar Sastry,et al.  Provably safe and robust learning-based model predictive control , 2011, Autom..

[9]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[10]  Makoto Sato,et al.  TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[13]  Andreas Krause,et al.  Safe Exploration in Finite Markov Decision Processes with Gaussian Processes , 2016, NIPS.

[14]  Claire J. Tomlin,et al.  Guaranteed safe online learning of a bounded system , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[16]  Paulo Tabuada,et al.  Control Barrier Function Based Quadratic Programs for Safety Critical Systems , 2016, IEEE Transactions on Automatic Control.

[17]  Yisong Yue,et al.  Safe Exploration and Optimization of Constrained MDPs Using Gaussian Processes , 2018, AAAI.

[18]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[19]  Ashish Kapoor,et al.  Fast Safe Mission Plans for Autonomous Vehicles , 2016 .

[20]  Javier García,et al.  Safe Exploration of State and Action Spaces in Reinforcement Learning , 2012, J. Artif. Intell. Res..

[21]  Pooriya Beyhaghi,et al.  Optimization combining derivative-free global exploration with derivative-based local refinement , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[22]  Jaime F. Fisac,et al.  Reachability-based safe learning with Gaussian processes , 2014, 53rd IEEE Conference on Decision and Control.

[23]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[24]  Chris Gaskett,et al.  Reinforcement learning under circumstances beyond its control , 2003 .

[25]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..