Efficient exploration with Double Uncertain Value Networks

This paper studies directed exploration for reinforcement learning agents by tracking uncertainty about the value of each available action. We identify two sources of uncertainty that are relevant for exploration. The first originates from limited data (parametric uncertainty), while the second originates from the distribution of the returns (return uncertainty). We identify methods to learn these distributions with deep neural networks, where we estimate parametric uncertainty with Bayesian drop-out, while return uncertainty is propagated through the Bellman equation as a Gaussian distribution. Then, we identify that both can be jointly estimated in one network, which we call the Double Uncertain Value Network. The policy is directly derived from the learned distributions based on Thompson sampling. Experimental results show that both types of uncertainty may vastly improve learning in domains with a strong exploration challenge.

[1]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[2]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[3]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[4]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[5]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[6]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[7]  John N. Tsitsiklis,et al.  Bias and variance in value function estimation , 2004, ICML.

[8]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[9]  Lihong Li,et al.  An Empirical Evaluation of Thompson Sampling , 2011, NIPS.

[10]  D. White Mean, variance, and probabilistic criteria in finite Markov decision processes: A review , 1988 .

[11]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12]  Shie Mannor,et al.  Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[13]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[14]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[15]  Catholijn M. Jonker,et al.  Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning , 2017, ArXiv.

[16]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[17]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[18]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[19]  José Miguel Hernández-Lobato,et al.  Uncertainty Decomposition in Bayesian Neural Networks with Latent Variables , 2017, 1706.08495.

[20]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[21]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[22]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[23]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[24]  Jan Peters,et al.  Generalized exploration in policy search , 2017, Machine Learning.

[25]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[26]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[27]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[28]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[29]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[30]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[31]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[32]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[33]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[34]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[35]  Peter Dayan,et al.  Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search , 2012, NIPS.

[36]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[37]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[38]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[39]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[40]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.