论文信息 - UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles - 字舞流文

UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the $Q$-learning setting. First we propose an exploration strategy based on upper-confidence bounds (UCB). Next, we define an ''InfoGain'' exploration bonus, which depends on the disagreement of the $Q$-ensemble. Our experiments show significant gains on the Atari benchmark.

Pieter Abbeel | John Schulman | Szymon Sidor | Richard Y. Chen | J. Schulman | P. Abbeel | Szymon Sidor

[1] Charles Blundell,et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[2] Lihong Li,et al. PAC model-free reinforcement learning , 2006, ICML.

[3] A. P. Hyper-parameters. Count-Based Exploration with Neural Density Models , 2017 .

[4] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[5] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[6] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[7] David Andre,et al. Model based Bayesian Exploration , 1999, UAI.

[8] Benjamin Van Roy,et al. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[9] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[10] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[11] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[14] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[15] Justin Fu,et al. EX2: Exploration with Exemplar Models for Deep Reinforcement Learning , 2017, NIPS.

[16] Csaba Szepesvári,et al. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[17] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[18] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[19] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[20] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[22] Yi Sun,et al. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments , 2011, AGI.

[23] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.