The Information Geometry of Unsupervised Reinforcement Learning

How can a reinforcement learning (RL) agent prepare to solve downstream tasks if those tasks are not known a priori? One approach is unsupervised skill discovery, a class of algorithms that learn a set of policies without access to a reward function. Such algorithms bear a close resemblance to representation learning algorithms (e.g., contrastive learning) in supervised learning, in that both are pretraining algorithms that maximize some approximation to a mutual information objective. While prior work has shown that the set of skills learned by such methods can accelerate downstream RL tasks, prior work offers little analysis into whether these skill learning algorithms are optimal, or even what notion of optimality would be appropriate to apply to them. In this work, we show that unsupervised skill discovery algorithms based on mutual information maximization do not learn skills that are optimal for every possible reward function. However, we show that the distribution over skills provides an optimal initialization minimizing regret against adversarially-chosen reward functions, assuming a certain type of adaptation procedure. Our analysis also provides a geometric perspective on these skill learning methods.

[1]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[2]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[3]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[4]  Karl Stratos,et al.  Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[5]  Sergey Levine,et al.  Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control , 2018, ArXiv.

[6]  Tao Wang,et al.  Dual Representations for Dynamic Programming and Reinforcement Learning , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[7]  Akshay Krishnamurthy,et al.  Model-free Representation Learning and Exploration in Low-rank MDPs , 2021, ArXiv.

[8]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[9]  Richard Socher,et al.  Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills , 2020, ICML.

[10]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[11]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[13]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[14]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[15]  Jinwoo Shin,et al.  State Entropy Maximization with Random Encoders for Efficient Exploration , 2021, ICML.

[16]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[17]  Sergey Levine,et al.  Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills , 2021, ICML.

[18]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[19]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[20]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[21]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Franziska Abend,et al.  Facility Location Concepts Models Algorithms And Case Studies , 2016 .

[23]  Pieter Abbeel,et al.  URLB: Unsupervised Reinforcement Learning Benchmark , 2021, NeurIPS Datasets and Benchmarks.

[24]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[26]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[27]  Philip Bachman,et al.  Pretraining Representations for Data-Efficient Reinforcement Learning , 2021, NeurIPS.

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[30]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[31]  A. Shwartz,et al.  Handbook of Markov decision processes : methods and applications , 2002 .

[32]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[33]  Akshay Krishnamurthy,et al.  Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning , 2019, ICML.

[34]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[35]  Sergey Levine,et al.  Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[36]  Nicolas Le Roux,et al.  A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[37]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[38]  E. Minieka The m-Center Problem , 1970 .

[39]  Stefano Ermon,et al.  Understanding the Limitations of Variational Mutual Information Estimators , 2020, ICLR.

[40]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[41]  Boris Ryabko Coding of a source with unknown but ordered probabilities , 2015 .

[42]  M. Rao,et al.  The m-Center Problem: Minimax Facility Location , 1977 .

[43]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[44]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[45]  Andrew Y. Ng,et al.  Policy Search via Density Estimation , 1999, NIPS.

[46]  Sergey Levine,et al.  Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings , 2018, ICML.

[47]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[48]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[49]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[50]  Nicolas Le Roux,et al.  The Value Function Polytope in Reinforcement Learning , 2019, ICML.