Meta-Learning with Warped Gradient Descent

Learning an efficient update rule from data that promotes rapid learning of new tasks from the same distribution remains an open problem in meta-learning. Typically, previous works have approached this issue either by attempting to train a neural network that directly produces updates or by attempting to learn better initialisations or scaling factors for a gradient-based update rule. Both of these approaches pose challenges. On one hand, directly producing an update forgoes a useful inductive bias and can easily lead to non-converging behaviour. On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their meta-gradients, leading to methods that can not scale beyond few-shot task adaptation. In this work, we propose Warped Gradient Descent (WarpGrad), a method that intersects these approaches to mitigate their limitations. WarpGrad meta-learns an efficiently parameterised preconditioning matrix that facilitates gradient descent across the task distribution. Preconditioning arises by interleaving non-linear layers, referred to as warp-layers, between the layers of a task-learner. Warp-layers are meta-learned without backpropagating through the task training process in a manner similar to methods that learn to directly produce updates. WarpGrad is computationally efficient, easy to implement, and can scale to arbitrarily large meta-learning problems. We provide a geometrical interpretation of the approach and evaluate its effectiveness in a variety of settings, including few-shot, standard supervised, continual and reinforcement learning.

[1]  Misha Denil,et al.  Learning to Learn without Gradient Descent by Gradient Descent , 2016, ICML.

[2]  Neil D. Lawrence,et al.  Transferring Knowledge across Learning Processes , 2018, ICLR.

[3]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jascha Sohl-Dickstein,et al.  Meta-Learning Update Rules for Unsupervised Representation Learning , 2018, ICLR.

[5]  Quoc V. Le,et al.  HyperNetworks , 2016, ICLR.

[6]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[7]  Geoffrey E. Hinton Using fast weights to deblur old memories , 1987 .

[8]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[9]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[10]  Wei Shen,et al.  Few-Shot Image Recognition by Predicting Parameters from Activations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[12]  Sergey Levine,et al.  Guided Meta-Policy Search , 2019, NeurIPS.

[13]  Bin Wu,et al.  Deep Meta-Learning: Learning to Learn in the Concept Space , 2018, ArXiv.

[14]  Tsendsuren Munkhdalai,et al.  Learning Rapid-Temporal Adaptations , 2017, ArXiv.

[15]  Kenneth O. Stanley,et al.  Differentiable plasticity: training plastic neural networks with backpropagation , 2018, ICML.

[16]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[17]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Luca Bertinetto,et al.  Learning feed-forward one-shot learners , 2016, NIPS.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Alexandre Lacoste,et al.  Uncertainty in Multitask Transfer Learning , 2018, ArXiv.

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[23]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[26]  Seungjin Choi,et al.  Meta-Learning with Adaptive Layerwise Metric and Subspace , 2018, ArXiv.

[27]  Joseph Suarez,et al.  Language Modeling with Recurrent Highway Hypernetworks , 2017, NIPS.

[28]  Joshua B. Tenenbaum,et al.  Meta-Learning for Semi-Supervised Few-Shot Classification , 2018, ICLR.

[29]  John M. Lee Introduction to Smooth Manifolds , 2002 .

[30]  Andrea Vedaldi,et al.  Universal representations: The missing link between faces, text, planktons, and cat breeds , 2017, ArXiv.

[31]  Katja Hofmann,et al.  Fast Context Adaptation via Meta-Learning , 2018, ICML.

[32]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[33]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[34]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[35]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[36]  Junier B. Oliva,et al.  Meta-Curvature , 2019, NeurIPS.

[37]  Andrea Vedaldi,et al.  Learning multiple visual domains with residual adapters , 2017, NIPS.

[38]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[39]  Amos J. Storkey,et al.  How to train your MAML , 2018, ICLR.

[40]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[41]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[42]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[43]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[44]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[45]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[46]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[47]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[48]  Yoshua Bengio,et al.  Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[49]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[50]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[51]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[52]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[53]  Angelika Steger,et al.  Fast-Slow Recurrent Neural Networks , 2017, NIPS.

[54]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[55]  Hujun Yin,et al.  Breaking the Activation Function Bottleneck through Adaptive Parameterization , 2018, NeurIPS.

[56]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[57]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[58]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[59]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[60]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[61]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[63]  Joshua B. Tenenbaum,et al.  One shot learning of simple visual concepts , 2011, CogSci.

[64]  Kenneth O. Stanley,et al.  Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity , 2018, ICLR.

[65]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[66]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[67]  Hans-Georg Zimmermann,et al.  Recurrent Neural Networks are Universal approximators , 2007, Int. J. Neural Syst..

[68]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[70]  Hang Li,et al.  Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[71]  Alexandre Lacoste,et al.  TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[72]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[73]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[74]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.