Dead-ends and Secure Exploration in Reinforcement Learning

Many interesting applications of reinforcement learning (RL) involve MDPs that include many “dead-end” states. Upon reaching a dead-end state, the agent continues to interact with the environment in a dead-end trajectory before reaching a terminal state, but cannot collect any positive reward, regardless of whatever actions are chosen by the agent. The situation is even worse when existence of many dead-end states is coupled with distant positive rewards from any initial state (it is called Bridge Effect). Hence, conventional exploration techniques often incur prohibitively large training steps before convergence. To deal with the bridge effect, we propose a condition for exploration, called security. We next establish formal results that translate the security condition into the learning problem of an auxiliary value function. This new value function is used to cap “any” given exploration policy and is guaranteed to make it secure. As a special case, we use this theory and introduce secure random-walk. We next extend our results to the deep RL settings by identifying and addressing two main challenges that arise. Finally, we empirically compare secure random-walk with standard benchmarks in two sets of experiments including the Atari game of Montezuma’s Revenge.

[1]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[2]  Romain Laroche,et al.  Scaling up budgeted reinforcement learning , 2019, ArXiv.

[3]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[4]  E. Altman Constrained Markov Decision Processes , 1999 .

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Laurent Orseau,et al.  Safely Interruptible Agents , 2016, UAI.

[7]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[8]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[9]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[10]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[11]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[14]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[15]  Rachid Guerraoui,et al.  Dynamic Safe Interruptibility for Decentralized Multi-Agent Reinforcement Learning , 2017, NIPS.

[16]  T. Basar,et al.  H∞-0ptimal Control and Related Minimax Design Problems: A Dynamic Game Approach , 1996, IEEE Trans. Autom. Control..