Sample-Efficient Reinforcement Learning with Maximum Entropy Mellowmax Episodic Control

Deep networks have enabled reinforcement learning to scale to more complex and challenging domains, but these methods typically require large quantities of training data. An alternative is to use sample-efficient episodic control methods: neuro-inspired algorithms which use non-/semi-parametric models that predict values based on storing and retrieving previously experienced transitions. One way to further improve the sample efficiency of these approaches is to use more principled exploration strategies. In this work, we therefore propose maximum entropy mellowmax episodic control (MEMEC), which samples actions according to a Boltzmann policy with a state-dependent temperature. We demonstrate that MEMEC outperforms other uncertainty- and softmax-based exploration methods on classic reinforcement learning environments and Atari games, achieving both more rapid learning and higher final rewards.

[1]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[2]  Richard Y. Chen,et al.  UCB EXPLORATION VIA Q-ENSEMBLES , 2018 .

[3]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[4]  Peter Dayan,et al.  Hippocampal Contributions to Control: The Third Way , 2007, NIPS.

[5]  Ian Osband,et al.  Risk versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout , 2016 .

[6]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[7]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[8]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[9]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[10]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[11]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[14]  Demis Hassabis,et al.  Neural Episodic Control , 2017, ICML.

[15]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[16]  Joel Z. Leibo,et al.  Model-Free Episodic Control , 2016, ArXiv.

[17]  Marlos C. Machado,et al.  A Laplacian Framework for Option Discovery in Reinforcement Learning , 2017, ICML.

[18]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[19]  Brendan Maginnis,et al.  A short variational proof of equivalence between policy gradients and soft Q learning , 2017, ArXiv.

[20]  Mieke Verfaellie,et al.  Interdependence of episodic and semantic memory: Evidence from neuropsychology , 2010, Journal of the International Neuropsychological Society.

[21]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[22]  Kavosh Asadi,et al.  DeepMellow: Removing the Need for a Target Network in Deep Q-Learning , 2019, IJCAI.

[23]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[24]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[25]  H. Robbins,et al.  Asymptotically efficient adaptive allocation rules , 1985 .

[26]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[27]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[28]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[29]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[30]  Vicenç Gómez,et al.  A unified view of entropy-regularized Markov decision processes , 2017, ArXiv.