Geometric Entropic Exploration

Exploration is essential for solving complex Reinforcement Learning (RL) tasks. Maximum State-Visitation Entropy (MSVE) formulates the exploration problem as a welldefined policy optimization problem whose solution aims at visiting all states as uniformly as possible. This is in contrast to standard uncertainty-based approaches where exploration is transient and eventually vanishes. However, existing approaches to MSVE are theoretically justified only for discrete statespaces as they are oblivious to the geometry of continuous domains. We address this challenge by introducing Geometric Entropy Maximisation (GEM), a new algorithm that maximises the geometry-aware Shannon entropy of state-visits in both discrete and continuous domains. Our key theoretical contribution is casting geometry-aware MSVE exploration as a tractable problem of optimising a simple and novel noise-contrastive objective function. In our experiments, we show the efficiency of GEM in solving several RL problems with sparse rewards, compared against other deep RL exploration approaches. Preliminary work. Under review by AISTATS 2021. Do not distribute.

[1]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Lihong Li,et al.  PAC model-free reinforcement learning , 2006, ICML.

[4]  David Lopez-Paz,et al.  Optimizing the Latent Space of Generative Networks , 2017, ICML.

[5]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[6]  Jose Gallego-Posada,et al.  GAIT: A Geometric Approach to Information Theory , 2020, AISTATS.

[7]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Christoph Dann,et al.  Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[11]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[12]  Gabriel Peyré,et al.  Geometric Losses for Distributional Learning , 2019, ICML.

[13]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[14]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[15]  Peter Auer,et al.  Autonomous Exploration For Navigating In MDPs , 2012, COLT.

[16]  Marcello Restelli,et al.  An Intrinsically-Motivated Approach for Learning Highly Exploring and Fast Mixing Policies , 2019, AAAI.

[17]  Akshay Krishnamurthy,et al.  Reward-Free Exploration for Reinforcement Learning , 2020, ICML.

[18]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[19]  Alessandro Lazaric,et al.  No-Regret Exploration in Goal-Oriented Reinforcement Learning , 2020, ICML.

[20]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[21]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[22]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[23]  Doina Precup,et al.  Marginalized State Distribution Entropy Regularization in Policy Optimization , 2019, ArXiv.

[24]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[25]  Marc Pollefeys,et al.  Episodic Curiosity through Reachability , 2018, ICLR.

[26]  Sergey Levine,et al.  Wasserstein Dependency Measure for Representation Learning , 2019, NeurIPS.

[27]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  David Filliat,et al.  State Representation Learning for Control: An Overview , 2018, Neural Networks.

[30]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[31]  Dahua Lin,et al.  Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination , 2018, ArXiv.

[32]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[33]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[34]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[35]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[36]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[37]  Robert E. Kalaba,et al.  Dynamic Programming and Modern Control Theory , 1966 .

[38]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[39]  Ronald Ortner,et al.  Autonomous exploration for navigating in non-stationary CMPs , 2019, ArXiv.

[40]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[41]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[42]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[43]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[44]  Haim Kaplan,et al.  Near-optimal Regret Bounds for Stochastic Shortest Path , 2020, ICML.

[45]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[46]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.