Recomposing the Reinforcement Learning Building Blocks with Hypernetworks

The Reinforcement Learning (RL) building blocks, i.e. Q-functions and policy networks, usually take elements from the cartesian product of two domains as input. In particular, the input of the Q-function is both the state and the action, and in multi-task problems (Meta-RL) the policy can take a state and a context. Standard architectures tend to ignore these variables’ underlying interpretations and simply concatenate their features into a single vector. In this work, we argue that this choice may lead to poor gradient estimation in actor-critic algorithms and high variance learning steps in Meta-RL algorithms. To consider the interaction between the input variables, we suggest using a Hypernetwork architecture where a primary network determines the weights of a conditional dynamic network. We show that this approach improves the gradient approximation and reduces the learning step variance, which both accelerates learning and improves the final performance. We demonstrate a consistent improvement across different locomotion tasks and different algorithms both in RL (TD3 and SAC) and in Meta-RL (MAML and PEARL).

[1]  Homanga Bharadhwaj,et al.  Continual Model-Based Reinforcement Learning with Hypernetworks , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Nikhil Ketkar,et al.  Introduction to PyTorch , 2021, Deep Learning with Python.

[3]  Benjamin F. Grewe,et al.  Meta-Learning via Hypernetworks , 2020 .

[4]  Tomer Galanti,et al.  On the Modularity of Hypernetworks , 2020, NeurIPS.

[5]  Animesh Garg,et al.  D2RL: Deep Dense Architectures in Reinforcement Learning , 2020, ArXiv.

[6]  Yoram Louzoun,et al.  Explicit Gradient Learning for Black-Box Optimization , 2020, ICML.

[7]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[8]  Yee Whye Teh,et al.  Multiplicative Interactions and Where to Find Them , 2020, ICLR.

[9]  Hod Lipson,et al.  Principled Weight Initialization for Hypernetworks , 2020, ICLR.

[10]  Lior Wolf,et al.  Comparing the Parameter Complexity of Hypernetworks and the Embedding-Based Alternative , 2020, ArXiv.

[11]  Meta Reinforcement Learning with Autonomous Inference of Subtask Dependencies , 2020, ICLR.

[12]  Alex Smola,et al.  Meta-Q-Learning , 2019, ICLR.

[13]  Benjamin F. Grewe,et al.  Continual learning with hypernetworks , 2019, ICLR.

[14]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[15]  Larry Rudolph,et al.  A Closer Look at Deep Policy Gradients , 2018, ICLR.

[16]  Lior Wolf,et al.  Deep Meta Functionals for Shape Representation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Sergey Levine,et al.  Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.

[18]  Erik Nijkamp,et al.  A Generative Model for Sampling High-Performance and Diverse Weights for Neural Networks , 2019, ArXiv.

[19]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[20]  Fuxin Li,et al.  HyperGAN: A Generative Model for Diverse, Performant Neural Networks , 2019, ICML.

[21]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[22]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[23]  Context-Based Meta-Reinforcement Learning with Structured Latent Space , 2019 .

[24]  Saeed Saremi,et al.  On approximating ∇f with neural networks , 2019, ArXiv.

[25]  Alexey Potapov,et al.  HyperNets and their application to learning spatial transformations , 2018, ICANN.

[26]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[27]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[28]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[29]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[30]  Theodore Lim,et al.  SMASH: One-Shot Model Architecture Search through HyperNetworks , 2017, ICLR.

[31]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[32]  Jaime G. Carbonell,et al.  The exploding gradient problem demystified - definition, prevalence, impact, origin, tradeoffs, and solutions , 2017 .

[33]  Richard S. Sutton,et al.  A Deeper Look at Experience Replay , 2017, ArXiv.

[34]  Takayuki Okatani,et al.  HyperNetworks with statistical filtering for defending adversarial examples , 2017, ArXiv.

[35]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[36]  Li Zhang,et al.  Learning to Learn: Meta-Critic Networks for Sample Efficient Learning , 2017, ArXiv.

[37]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[38]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[39]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[41]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[42]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[43]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[44]  Jianfeng Gao,et al.  Deep Reinforcement Learning with a Natural Language Action Space , 2015, ACL.

[45]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[46]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[47]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[48]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[49]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[50]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[51]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[53]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[54]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[55]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[56]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[57]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[58]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[59]  Robert Givan,et al.  Model Minimization in Markov Decision Processes , 1997, AAAI/IAAI.

[60]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[61]  James L. McClelland Putting Knowledge in its Place: A Scheme for Programming Parallel Processing Structures on the Fly , 1988, Cogn. Sci..