Count-Based Exploration with Neural Density Models

Bellemare et al. (2016) introduced the notion of a pseudo-count, derived from a density model, to generalize count-based exploration to non-tabular reinforcement learning. This pseudo-count was used to generate an exploration bonus for a DQN agent and combined with a mixed Monte Carlo update was sufficient to achieve state of the art on the Atari 2600 game Montezuma's Revenge. We consider two questions left open by their work: First, how important is the quality of the density model for exploration? Second, what role does the Monte Carlo update play in exploration? We answer the first question by demonstrating the use of PixelCNN, an advanced neural density model for images, to supply a pseudo-count. In particular, we examine the intrinsic difficulties in adapting Bellemare et al.'s approach when assumptions about the model are violated. The result is a more practical and general algorithm requiring no special apparatus. We combine PixelCNN pseudo-counts with different agent architectures to dramatically improve the state of the art on several hard Atari games. One surprising finding is that the mixed Monte Carlo update is a powerful facilitator of exploration in the sparsest of settings, including Montezuma's Revenge.

[1]  Marc G. Bellemare,et al.  The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[2]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[3]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[4]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[5]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[6]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[7]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[8]  Marc G. Bellemare,et al.  Skip Context Tree Switching , 2014, ICML.

[9]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[10]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[11]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[12]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[13]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[16]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[17]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[18]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[19]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[20]  J. Schulman,et al.  Variational Information Maximizing Exploration , 2016 .

[21]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[22]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[23]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[24]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[25]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[26]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[27]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[28]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[29]  Joel Veness,et al.  Context Tree Switching , 2011, 2012 Data Compression Conference.

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.