Randomized Prior Functions for Deep Reinforcement Learning

Dealing with uncertainty is essential for efficient reinforcement learning. There is a growing literature on uncertainty estimation for deep learning from fixed datasets, but many of the most popular approaches are poorly-suited to sequential decision problems. Other methods, such as bootstrap sampling, have no mechanism for uncertainty that does not come from the observed data. We highlight why this can be a crucial shortcoming and propose a simple remedy through addition of a randomized untrainable `prior' network to each ensemble member. We prove that this approach is efficient with linear representations, provide simple illustrations of its efficacy with nonlinear representations and show that this approach scales to large-scale problems far better than previous attempts.

[1]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2]  B. D. Finetti La prévision : ses lois logiques, ses sources subjectives , 1937 .

[3]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[4]  David Roxbee Cox,et al.  Problems and solutions in theoretical statistics , 1978 .

[5]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[6]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[7]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[8]  S. T. Buckland,et al.  An Introduction to the Bootstrap , 1994 .

[9]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[10]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[11]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[14]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[15]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Tadayoshi Fushiki Bootstrap prediction and Bayesian prediction under misspecified models , 2005 .

[18]  Tadayoshi Fushiki,et al.  Nonparametric bootstrap prediction , 2005 .

[19]  Shane Legg,et al.  A Collection of Definitions of Intelligence , 2007, AGI.

[20]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[21]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[22]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[23]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[24]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Shipra Agrawal,et al.  Analysis of Thompson Sampling for the Multi-armed Bandit Problem , 2011, COLT.

[27]  Ian ( More ) Efficient Reinforcement Learning via Posterior Sampling , 2013 .

[28]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[29]  Shipra Agrawal,et al.  Further Optimal Regret Bounds for Thompson Sampling , 2012, AISTATS.

[30]  Benjamin Van Roy,et al.  Learning to Optimize via Posterior Sampling , 2013, Math. Oper. Res..

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[33]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[34]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[35]  Benjamin Van Roy,et al.  Bootstrapped Thompson Sampling and Deep Exploration , 2015, ArXiv.

[36]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[41]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[42]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[43]  Laurent Orseau,et al.  Thompson Sampling is Asymptotically Optimal in General Environments , 2016, UAI.

[44]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[45]  Zachary Chase Lipton,et al.  Efficient Exploration for Dialogue Policy Learning with BBQ Networks & Replay Buffer Spiking , 2016 .

[46]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[47]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[48]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[49]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[50]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[51]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[52]  Ian Osband,et al.  Risk versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout , 2016 .

[53]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[54]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[55]  Jianfeng Gao,et al.  Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking , 2016, ArXiv.

[56]  Alex Kendall,et al.  Concrete Dropout , 2017, NIPS.

[57]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[58]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[59]  Benjamin Van Roy,et al.  Ensemble Sampling , 2017, NIPS.

[60]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[61]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[62]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[63]  Pieter Abbeel,et al.  UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles , 2017, ArXiv.

[64]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[65]  Yunhao Tang,et al.  Variational Deep Q Network , 2017, ArXiv.

[66]  Sylvain Gelly,et al.  Gradient Descent Quantizes ReLU Network Features , 2018, ArXiv.

[67]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[68]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[69]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[70]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[71]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[72]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[73]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[74]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[75]  Joelle Pineau,et al.  Randomized Value Functions via Multiplicative Normalizing Flows , 2018, UAI.

[76]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..