Information-Directed Exploration for Deep Reinforcement Learning

Efficient exploration remains a major challenge for reinforcement learning. One reason is that the variability of the returns often depends on the current state and action, and is therefore heteroscedastic. Classical exploration strategies such as upper confidence bound algorithms and Thompson sampling fail to appropriately account for heteroscedasticity, even in the bandit setting. Motivated by recent findings that address this issue in bandits, we propose to use Information-Directed Sampling (IDS) for exploration in reinforcement learning. As our main contribution, we build on recent advances in distributional reinforcement learning and propose a novel, tractable approximation of IDS for deep Q-learning. The resulting exploration strategy explicitly accounts for both parametric uncertainty and heteroscedastic observation noise. We evaluate our method on Atari games and demonstrate a significant improvement over alternative approaches.

[1]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[2]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[3]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[4]  Pieter Abbeel,et al.  UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles , 2017, ArXiv.

[5]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[6]  Benjamin Van Roy,et al.  Learning to Optimize via Information-Directed Sampling , 2014, NIPS.

[7]  Yunhao Tang,et al.  Exploration by Distributional Reinforcement Learning , 2018, IJCAI.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[10]  Jürgen Schmidhuber,et al.  Formal Theory of Fun and Creativity , 2010, ECML/PKDD.

[11]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[12]  Rémi Munos,et al.  Implicit Quantile Networks for Distributional Reinforcement Learning , 2018, ICML.

[13]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[14]  Andrea Zanette,et al.  Information Directed reinforcement learning , 2017 .

[15]  Murray Shanahan,et al.  Deep Reinforcement Learning with Risk-Seeking Exploration , 2018, SAB.

[16]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[17]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[18]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[19]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[20]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[21]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[22]  Catholijn M. Jonker,et al.  The Potential of the Return Distribution for Exploration in RL , 2018, ArXiv.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[25]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[26]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[27]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[28]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[29]  George E. Monahan,et al.  A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 2007 .

[30]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[31]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[32]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[33]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[34]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[35]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[36]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[37]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[38]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[39]  Catholijn M. Jonker,et al.  Efficient exploration with Double Uncertain Value Networks , 2017, ArXiv.

[40]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[41]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[42]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[43]  Andreas Krause,et al.  Information Directed Sampling and Bandits with Heteroscedastic Noise , 2018, COLT.

[44]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[45]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[46]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[47]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[48]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[49]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[50]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[51]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.