Robust Exploration/Exploitation Trade-Offs in Safety-Critical Applications

Abstract With regard to future service robots, unsafe exceptional circumstances can occur in complex systems that are hardly to foresee. In this paper, the assumption of having no knowledge about the environment is investigated using reinforcement learning as an option for learning behavior by trial-and-error. In such a scenario, action-selection decisions are made based on future reward predictions for minimizing costs in reaching a goal. It is shown that the selection of safety-critical actions leading to highly negative costs from the environment is directly related to the exploration/exploitation dilemma in temporal-difference learning. For this, several exploration policies are investigated with regard to worst- and best-case performance in a dynamic environment. Our results show that in contrast to established exploration policies like e-Greedy and Softmax, the recently proposed VDBE-Softmax policy seems to be more appropriate for such applications due to its robustness of the exploration parameter for unexpected situations.

[1]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[2]  Günther Palm,et al.  Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax , 2011, KI.

[3]  Raja Chatila,et al.  On Fault Tolerance and Robustness in Autonomous Systems , 2004 .

[4]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[5]  T. Smithers Autonomy in Robots and Other Agents , 1997, Brain and Cognition.

[6]  Andrew G. Barto,et al.  Lyapunov Design for Safe Reinforcement Learning , 2003, J. Mach. Learn. Res..

[7]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[8]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[9]  Peter Geibel,et al.  Reinforcement Learning with Bounded Risk , 2001, ICML.

[10]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[11]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[12]  Dirk Söffker,et al.  On Risk Formalization of On-Line Risk Assessment for Safe Decision Making in Robotics , 2010 .

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Steffen Udluft,et al.  Safe exploration for reinforcement learning , 2008, ESANN.

[15]  Warren B. Powell,et al.  Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming , 2006, Machine Learning.

[16]  H. Jin Kim,et al.  Stable adaptive control with online learning , 2004, NIPS.

[17]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[18]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .