Deep Reinforcement Learning with Risk-Seeking Exploration

In most contemporary work in deep reinforcement learning (DRL), agents are trained in simulated environments. Not only are simulated environments fast and inexpensive, they are also ‘safe’. By contrast, training in a real world environment (using robots, for example) is not only slow and costly, but actions can also result in irreversible damage, either to the environment or to the agent (robot) itself. In this paper, we consider taking advantage of the inherent safety in computer simulation by extending the Deep Q-Network (DQN) algorithm with an ability to measure and take risk. In essence, we propose a novel DRL algorithm that encourages risk-seeking behaviour to enhance information acquisition during training. We demonstrate the merit of the exploration heuristic by (i) arguing that our risk estimator implicitly contains both parametric uncertainty and inherent uncertainty of the environment which are propagated back through Temporal Difference error across many time steps and (ii) evaluating our method on three games in the Atari domain and showing that the technique works well on Montezuma’s Revenge, a game that epitomises the challenge of sparse reward.

[1]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[2]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[3]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[4]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[7]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[8]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[9]  Ben Glocker,et al.  Implicit Weight Uncertainty in Neural Networks. , 2017 .

[10]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[11]  Shie Mannor,et al.  Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[12]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[13]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[17]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[18]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[19]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..