A Policy Gradient Method for Task-Agnostic Exploration

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by limited-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning a variety of meaningful reward-based tasks downstream.

[1]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[2]  J. L. Hodges,et al.  The Poisson Approximation to the Poisson Binomial Distribution , 1960 .

[3]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[4]  Ronald Ortner,et al.  Autonomous exploration for navigating in non-stationary CMPs , 2019, ArXiv.

[5]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[6]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[7]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[8]  Pierre-Yves Oudeyer,et al.  Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[9]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[10]  Taehoon Kim,et al.  Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[11]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[12]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[13]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[14]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[15]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[16]  Alessandro Lazaric,et al.  Active Exploration in Markov Decision Processes , 2019, AISTATS.

[17]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[18]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[19]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[20]  Jakub W. Pachocki,et al.  Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[21]  Andrea Bonarini,et al.  Incremental Skill Acquisition for Self-motivated Learning Animats , 2006, SAB.

[22]  Junhyuk Oh,et al.  What Can Learned Intrinsic Rewards Capture? , 2019, ICML.

[23]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[24]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[25]  Sergey Levine,et al.  Unsupervised Meta-Learning for Reinforcement Learning , 2018, ArXiv.

[26]  Pieter Abbeel,et al.  Variational Option Discovery Algorithms , 2018, ArXiv.

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Marcello Restelli,et al.  An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies , 2019, AAAI.

[29]  Yevgen Chebotar,et al.  Meta Learning via Learned Loss , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[30]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[31]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[32]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[33]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[34]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[37]  M. Simandl,et al.  Differential entropy estimation by particles , 2011 .

[38]  A. Lazaric,et al.  Self-Development Framework for Reinforcement Learning Agents , 2006 .

[39]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[40]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[41]  Harshinder Singh,et al.  Nearest Neighbor Estimates of Entropy , 2003 .

[42]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[43]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[44]  Alessandro Lazaric,et al.  Active Model Estimation in Markov Decision Processes , 2020, UAI.

[45]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[46]  Peter Auer,et al.  Autonomous Exploration For Navigating In MDPs , 2012, COLT.

[47]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[48]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[49]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[50]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[51]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).