Temporal Difference Uncertainties as a Signal for Exploration

An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.

[1]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[2]  Pushmeet Kohli,et al.  Strong Generalization and Efficiency in Neural Programs , 2020, ArXiv.

[3]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[4]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[5]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[6]  Catholijn M. Jonker,et al.  Efficient exploration with Double Uncertain Value Networks , 2017, ArXiv.

[7]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[8]  Dale Schuurmans,et al.  Improving Policy Gradient by Exploring Under-appreciated Rewards , 2016, ICLR.

[9]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[12]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[13]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[14]  Mikhail Belkin,et al.  Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning , 2020, ArXiv.

[15]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[16]  Qiang Liu,et al.  Learning to Explore via Meta-Policy Gradient , 2018, ICML.

[17]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[18]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[19]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[20]  H. Sebastian Seung,et al.  QXplore: Q-learning Exploration by Maximizing Temporal Difference Error , 2019, ArXiv.

[21]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[22]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[23]  Junhyuk Oh,et al.  A Self-Tuning Actor-Critic Algorithm , 2020, NeurIPS.

[24]  Kamyar Azizzadenesheli,et al.  Efficient Exploration Through Bayesian Deep Q-Networks , 2018, 2018 Information Theory and Applications Workshop (ITA).

[25]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[26]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[27]  Adam M White,et al.  DEVELOPING A PREDICTIVE APPROACH TO KNOWLEDGE , 2015 .

[28]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[29]  Xiang Ren,et al.  Temporal Attribute Prediction via Joint Modeling of Multi-Relational Structure Evolution , 2020, IJCAI.

[30]  Qiang Liu,et al.  Learning to Explore with Meta-Policy Gradient , 2018, ICML 2018.

[31]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[32]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[33]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[34]  Junhyuk Oh,et al.  Self-Tuning Deep Reinforcement Learning , 2020, ArXiv.

[35]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[36]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[37]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[38]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[39]  Ehud Ahissar,et al.  Reinforcement active learning hierarchical loops , 2011, The 2011 International Joint Conference on Neural Networks.

[40]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[41]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[42]  Tom Schaul,et al.  Adapting Behaviour for Learning Progress , 2019, ArXiv.

[43]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[44]  Honglak Lee,et al.  Contingency-Aware Exploration in Reinforcement Learning , 2018, ICLR.

[45]  Doina Precup,et al.  Smart exploration in reinforcement learning using absolute temporal difference errors , 2013, AAMAS.

[46]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[47]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[48]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Sebastian Tschiatschek,et al.  Successor Uncertainties: exploration and uncertainty in temporal difference learning , 2018, NeurIPS.

[51]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[52]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[53]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[54]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.