论文信息 - Temporal Difference Uncertainties as a Signal for Exploration

Temporal Difference Uncertainties as a Signal for Exploration

An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.

[1] Stuart J. Russell,et al. Bayesian Q-Learning , 1998, AAAI/IAAI.

[2] Qiang Liu,et al. Learning to Explore via Meta-Policy Gradient , 2018, ICML.

[3] Junhyuk Oh,et al. A Self-Tuning Actor-Critic Algorithm , 2020, NeurIPS.

[4] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[5] Honglak Lee,et al. Contingency-Aware Exploration in Reinforcement Learning , 2018, ICLR.

[6] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[7] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[8] Benjamin Van Roy,et al. Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[9] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[10] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[11] Alexei A. Efros,et al. Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[12] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[13] Pierre-Yves Oudeyer,et al. Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[14] H. Sebastian Seung,et al. QXplore: Q-learning Exploration by Maximizing Temporal Difference Error , 2019, ArXiv.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Qiang Liu,et al. Learning to Explore with Meta-Policy Gradient , 2018, ICML 2018.

[17] Pushmeet Kohli,et al. Strong Generalization and Efficiency in Neural Programs , 2020, ArXiv.

[18] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[19] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[20] Marlos C. Machado,et al. Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[21] Junhyuk Oh,et al. Self-Tuning Deep Reinforcement Learning , 2020, ArXiv.

[22] Pierre-Yves Oudeyer,et al. What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[23] Michel Tokic. Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[24] Mikhail Belkin,et al. Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning , 2020, ArXiv.

[25] Mikhail Belkin,et al. Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[26] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27] Jürgen Schmidhuber,et al. Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[28] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[29] Daan Wierstra,et al. Variational Intrinsic Control , 2016, ICLR.

[30] Jing Peng,et al. Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[31] Daniel Guo,et al. Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[32] Sebastian Tschiatschek,et al. Successor Uncertainties: exploration and uncertainty in temporal difference learning , 2018, NeurIPS.

[33] Tor Lattimore,et al. Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[34] Pieter Abbeel,et al. Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[35] Eric Mitchell,et al. Reward Prediction Error as an Exploration Objective in Deep RL , 2019 .

[36] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.

[37] Ian Osband,et al. The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[38] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[39] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.