Variational Empowerment as Representation Learning for Goal-Based Reinforcement Learning

Learning to reach goal states and learning diverse skills through mutual information (MI) maximization have been proposed as principled frameworks for self-supervised reinforcement learning, allowing agents to acquire broadly applicable multitask policies with minimal reward engineering. Starting from a simple observation that the standard goal-conditioned RL (GCRL) is encapsulated by the optimization objective of variational empowerment, we discuss how GCRL and MIbased RL can be generalized into a single family of methods, which we name variational GCRL (VGCRL), interpreting variational MI maximization, or variational empowerment, as representation learning methods that acquire functionallyaware state representations for goal reaching. This novel perspective allows us to: (1) derive simple but unexplored variants of GCRL to study how adding small representation capacity can already expand its capabilities; (2) investigate how discriminator function capacity and smoothness determine the quality of discovered skills, or latent goals, through modifying latent dimensionality and applying spectral normalization; (3) adapt techniques such as hindsight experience replay (HER) from GCRL to MI-based RL; and lastly, (4) propose a novel evaluation metric, named latent goal reaching (LGR), for comparing empowerment algorithms with different choices of latent dimensionality and discriminator parameterization. Through principled mathematical derivations and careful experimental studies, our work lays a novel foundation from which to evaluate, analyze, and develop representation learning techniques in goal-based RL. Work done while an intern at Google. University of Michigan Stanford University LG AI Research Google Research University of California, Berkeley. Correspondence to: Jongwook Choi <jwook@umich.edu>, Shixiang Shane Gu <shanegu@google.com>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

[1]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[2]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[3]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[4]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[5]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[6]  Anca D. Dragan,et al.  Inverse Reward Design , 2017, NIPS.

[7]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[8]  David P. Wipf,et al.  A New View of Automatic Relevance Determination , 2007, NIPS.

[9]  Sergey Levine,et al.  Unsupervised Meta-Learning for Reinforcement Learning , 2018, ArXiv.

[10]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[11]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[12]  David Barber,et al.  The IM algorithm: a variational approach to Information Maximization , 2003, NIPS 2003.

[13]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Chrystopher L. Nehaniv,et al.  All Else Being Equal Be Empowered , 2005, ECAL.

[15]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[16]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[17]  David Warde-Farley,et al.  Unsupervised Control Through Non-Parametric Discriminative Rewards , 2018, ICLR.

[18]  Peter Stone,et al.  Empowerment for continuous agent—environment systems , 2011, Adapt. Behav..

[19]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[20]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[21]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[22]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[23]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[25]  Shimon Whiteson,et al.  Expected Policy Gradients , 2017, AAAI.

[26]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[27]  Patrick van der Smagt,et al.  Unsupervised Real-Time Control Through Variational Empowerment , 2017, ISRR.

[28]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[29]  Sergey Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[30]  Sergey Levine,et al.  Near-Optimal Representation Learning for Hierarchical Reinforcement Learning , 2018, ICLR.

[31]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[32]  Richard Socher,et al.  Explore, Discover and Learn: Unsupervised Discovery of State-Covering Skills , 2020, ICML.

[33]  Richard E. Turner,et al.  Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[34]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[35]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[36]  Stefano Ermon,et al.  InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations , 2017, NIPS.

[37]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[38]  Richard Zemel,et al.  A Divergence Minimization Perspective on Imitation Learning Methods , 2019, CoRL.

[39]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[40]  Yifan Wu,et al.  The Laplacian in RL: Learning Representations with Efficient Approximations , 2018, ICLR.

[41]  Sergey Levine,et al.  Efficient Exploration via State Marginal Matching , 2019, ArXiv.

[42]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[43]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[44]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[47]  David Warde-Farley,et al.  Fast Task Inference with Variational Intrinsic Successor Features , 2019, ICLR.

[48]  Karol Hausman,et al.  Emergent Real-World Robotic Skills via Unsupervised Off-Policy Reinforcement Learning , 2020, Robotics: Science and Systems.

[49]  Pierre-Yves Oudeyer,et al.  CURIOUS: Intrinsically Motivated Modular Multi-Goal Reinforcement Learning , 2018, ICML.

[50]  Leslie Pack Kaelbling,et al.  Learning to Achieve Goals , 1993, IJCAI.

[51]  Sergey Levine,et al.  Visual Reinforcement Learning with Imagined Goals , 2018, NeurIPS.

[52]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[53]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Christoph Salge,et al.  Empowerment - an Introduction , 2013, ArXiv.

[55]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[56]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[57]  Sergey Levine,et al.  Outcome-Driven Reinforcement Learning via Variational Inference , 2021, NeurIPS.

[58]  Yee Whye Teh,et al.  Particle Value Functions , 2017, ICLR.

[59]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[60]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[61]  Sergey Levine,et al.  Skew-Fit: State-Covering Self-Supervised Reinforcement Learning , 2019, ICML.

[62]  Pieter Abbeel,et al.  Variational Option Discovery Algorithms , 2018, ArXiv.

[63]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[64]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.