论文信息 - #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

[1] Burton H. Bloom,et al. Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[2] David G. Lowe,et al. Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3] Li Fan,et al. Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[4] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[5] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[6] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[7] Michael L. Littman,et al. A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[8] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[9] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[10] Alexandr Andoni,et al. Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[11] Pierre-Yves Oudeyer,et al. What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[12] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[13] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[14] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[15] Geoffrey E. Hinton,et al. Semantic hashing , 2009, Int. J. Approx. Reason..

[16] Jürgen Schmidhuber,et al. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[17] Vincent Lepetit,et al. DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Yi Sun,et al. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments , 2011, AGI.

[19] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20] Jason Pazis,et al. PAC Optimal Exploration in Continuous Space Markov Decision Processes , 2013, AAAI.

[21] Peter Dayan,et al. Bayes-Adaptive Simulation-based Search with Value Function Approximation , 2014, NIPS.

[22] Shie Mannor,et al. Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[23] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[24] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[25] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[26] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27] Shane Legg,et al. Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[28] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[29] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[31] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[32] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[33] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[34] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[36] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[37] Koray Kavukcuoglu,et al. Pixel Recurrent Neural Networks , 2016, ICML.

[38] Fernando Diaz,et al. Exploratory Gradient Boosting for Reinforcement Learning in Complex Domains , 2016, ArXiv.

[39] David Silver,et al. Learning values across many orders of magnitude , 2016, NIPS.

[40] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[41] Alex Graves,et al. Strategic Attentive Writer for Learning Macro-Actions , 2016, NIPS.

[42] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[43] David Silver,et al. Learning functions across many orders of magnitudes , 2016, ArXiv.

[44] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[45] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[46] Xinyun Chen. Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[47] Daan Wierstra,et al. Towards Conceptual Compression , 2016, NIPS.

[48] T. L. Lai Andherbertrobbins. Asymptotically Efficient Adaptive Allocation Rules , 2022 .