论文信息 - Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning

Exploration in complex domains is a key challenge in reinforcement learning, especially for tasks with very sparse rewards. Recent successes in deep reinforcement learning have been achieved mostly using simple heuristic exploration strategies such as $\epsilon$-greedy action selection or Gaussian control noise, but there are many tasks where these methods are insufficient to make any learning progress. Here, we consider more complex heuristics: efficient and scalable exploration strategies that maximize a notion of an agent's surprise about its experiences via intrinsic motivation. We propose to learn a model of the MDP transition probabilities concurrently with the policy, and to form intrinsic rewards that approximate the KL-divergence of the true transition probabilities from the learned model. One of our approximations results in using surprisal as intrinsic motivation, while the other gives the $k$-step learning progress. We show that our incentives enable agents to succeed in a wide range of environments with high-dimensional state spaces and very sparse rewards, including continuous control tasks and games in the Atari RAM domain, outperforming several other heuristic exploration techniques.

S. Shankar Sastry | Joshua Achiam | S. Sastry | Joshua Achiam

[1] Jürgen Schmidhuber,et al. Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[2] S. Hochreiter,et al. REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[3] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[4] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[5] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[6] Pierre-Yves Oudeyer,et al. How can we define intrinsic motivation , 2008 .

[7] Pierre Baldi,et al. Bayesian surprise attracts human attention , 2005, Vision Research.

[8] Yi Sun,et al. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments , 2011, AGI.

[9] Pierre-Yves Oudeyer,et al. Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[10] A. Barto,et al. Novelty or Surprise? , 2013, Front. Psychol..

[11] Jason Pazis,et al. PAC Optimal Exploration in Continuous Space Markov Decision Processes , 2013, AAAI.

[12] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[13] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.