论文信息 - A Policy Gradient Method for Task-Agnostic Exploration

A Policy Gradient Method for Task-Agnostic Exploration

In a reward-free environment, what is a suitable intrinsic objective for an agent to pursue so that it can learn an optimal task-agnostic exploration policy? In this paper, we argue that the entropy of the state distribution induced by limited-horizon trajectories is a sensible target. Especially, we present a novel and practical policy-search algorithm, Maximum Entropy POLicy optimization (MEPOL), to learn a policy that maximizes a non-parametric, $k$-nearest neighbors estimate of the state distribution entropy. In contrast to known methods, MEPOL is completely model-free as it requires neither to estimate the state distribution of any policy nor to model transition dynamics. Then, we empirically show that MEPOL allows learning a maximum-entropy exploration policy in high-dimensional, continuous-control domains, and how this policy facilitates learning a variety of meaningful reward-based tasks downstream.

[1] L. Györfi,et al. Nonparametric entropy estimation. An overview , 1997 .

[2] J. L. Hodges,et al. The Poisson Approximation to the Poisson Binomial Distribution , 1960 .

[3] Jürgen Schmidhuber,et al. A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[4] Ronald Ortner,et al. Autonomous exploration for navigating in non-stationary CMPs , 2019, ArXiv.

[5] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[6] Sham M. Kakade,et al. Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[7] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[8] Pierre-Yves Oudeyer,et al. Intrinsic Motivation Systems for Autonomous Mental Development , 2007, IEEE Transactions on Evolutionary Computation.

[9] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[10] Taehoon Kim,et al. Quantifying Generalization in Reinforcement Learning , 2018, ICML.

[11] Alexei A. Efros,et al. Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[12] Akshay Krishnamurthy,et al. Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[13] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[14] Daan Wierstra,et al. Variational Intrinsic Control , 2016, ICLR.

[15] Anca D. Dragan,et al. Inverse Reward Design , 2017, NIPS.

[16] Alessandro Lazaric,et al. Active Exploration in Markov Decision Processes , 2019, AISTATS.

[17] Pierre-Yves Oudeyer,et al. Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[18] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[19] Shakir Mohamed,et al. Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[20] Jakub W. Pachocki,et al. Dota 2 with Large Scale Deep Reinforcement Learning , 2019, ArXiv.

[21] Andrea Bonarini,et al. Incremental Skill Acquisition for Self-motivated Learning Animats , 2006, SAB.

[22] Junhyuk Oh,et al. What Can Learned Intrinsic Rewards Capture? , 2019, ICML.

[23] Kenneth O. Stanley,et al. Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[24] Sergey Levine,et al. Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[25] Sergey Levine,et al. Unsupervised Meta-Learning for Reinforcement Learning , 2018, ArXiv.

[26] Pieter Abbeel,et al. Variational Option Discovery Algorithms , 2018, ArXiv.

[27] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28] Marcello Restelli,et al. An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies , 2019, AAAI.

[29] Yevgen Chebotar,et al. Meta Learning via Learned Loss , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[30] Marcello Restelli,et al. Policy Optimization via Importance Sampling , 2018, NeurIPS.

[31] Sergey Levine,et al. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[32] Sergey Levine,et al. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[33] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[34] Filip De Turck,et al. VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[35] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36] Christoph Salge,et al. Empowerment - an Introduction , 2013, ArXiv.

[37] M. Simandl,et al. Differential entropy estimation by particles , 2011 .

[38] A. Lazaric,et al. Self-Development Framework for Reinforcement Learning Agents , 2006 .

[39] Filip De Turck,et al. #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[40] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[41] Harshinder Singh,et al. Nearest Neighbor Estimates of Entropy , 2003 .

[42] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[43] Amos J. Storkey,et al. Exploration by Random Network Distillation , 2018, ICLR.

[44] Alessandro Lazaric,et al. Active Model Estimation in Markov Decision Processes , 2020, UAI.

[45] Sergey Levine,et al. Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[46] Peter Auer,et al. Autonomous Exploration For Navigating In MDPs , 2012, COLT.

[47] Jakub W. Pachocki,et al. Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[48] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[49] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[50] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[51] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).