论文信息 - Meta-Learning with Warped Gradient Descent

Meta-Learning with Warped Gradient Descent

Learning an efficient update rule from data that promotes rapid learning of new tasks from the same distribution remains an open problem in meta-learning. Typically, previous works have approached this issue either by attempting to train a neural network that directly produces updates or by attempting to learn better initialisations or scaling factors for a gradient-based update rule. Both of these approaches pose challenges. On one hand, directly producing an update forgoes a useful inductive bias and can easily lead to non-converging behaviour. On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their meta-gradients, leading to methods that can not scale beyond few-shot task adaptation. In this work, we propose Warped Gradient Descent (WarpGrad), a method that intersects these approaches to mitigate their limitations. WarpGrad meta-learns an efficiently parameterised preconditioning matrix that facilitates gradient descent across the task distribution. Preconditioning arises by interleaving non-linear layers, referred to as warp-layers, between the layers of a task-learner. Warp-layers are meta-learned without backpropagating through the task training process in a manner similar to methods that learn to directly produce updates. WarpGrad is computationally efficient, easy to implement, and can scale to arbitrarily large meta-learning problems. We provide a geometrical interpretation of the approach and evaluate its effectiveness in a variety of settings, including few-shot, standard supervised, continual and reinforcement learning.

Razvan Pascanu | Raia Hadsell | Hujun Yin | Sebastian Flennerhag | Andrei A. Rusu

[1] Misha Denil,et al. Learning to Learn without Gradient Descent by Gradient Descent , 2016, ICML.

[2] Neil D. Lawrence,et al. Transferring Knowledge across Learning Processes , 2018, ICLR.

[3] Subhransu Maji,et al. Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Jascha Sohl-Dickstein,et al. Meta-Learning Update Rules for Unsupervised Representation Learning , 2018, ICLR.

[5] Quoc V. Le,et al. HyperNetworks , 2016, ICLR.

[6] Renjie Liao,et al. Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[7] Geoffrey E. Hinton. Using fast weights to deblur old memories , 1987 .

[8] Martha White,et al. Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[9] R. French. Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[10] Wei Shen,et al. Few-Shot Image Recognition by Predicting Parameters from Activations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[12] Sergey Levine,et al. Guided Meta-Policy Search , 2019, NeurIPS.

[13] Bin Wu,et al. Deep Meta-Learning: Learning to Learn in the Concept Space , 2018, ArXiv.

[14] Tsendsuren Munkhdalai,et al. Learning Rapid-Temporal Adaptations , 2017, ArXiv.

[15] Kenneth O. Stanley,et al. Differentiable plasticity: training plastic neural networks with backpropagation , 2018, ICML.

[16] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[17] Nikos Komodakis,et al. Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Luca Bertinetto,et al. Learning feed-forward one-shot learners , 2016, NIPS.

[19] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20] Alexandre Lacoste,et al. Uncertainty in Multitask Transfer Learning , 2018, ArXiv.

[21] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[23] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[24] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[26] Seungjin Choi,et al. Meta-Learning with Adaptive Layerwise Metric and Subspace , 2018, ArXiv.

[27] Joseph Suarez,et al. Language Modeling with Recurrent Highway Hypernetworks , 2017, NIPS.

[28] Joshua B. Tenenbaum,et al. Meta-Learning for Semi-Supervised Few-Shot Classification , 2018, ICLR.

[29] John M. Lee. Introduction to Smooth Manifolds , 2002 .

[30] Andrea Vedaldi,et al. Universal representations: The missing link between faces, text, planktons, and cat breeds , 2017, ArXiv.

[31] Katja Hofmann,et al. Fast Context Adaptation via Meta-Learning , 2018, ICML.

[32] Razvan Pascanu,et al. Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[33] Zeb Kurth-Nelson,et al. Learning to reinforcement learn , 2016, CogSci.

[34] Marcin Andrychowicz,et al. Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[35] Joshua B. Tenenbaum,et al. Human-level concept learning through probabilistic program induction , 2015, Science.

[36] Junier B. Oliva,et al. Meta-Curvature , 2019, NeurIPS.

[37] Andrea Vedaldi,et al. Learning multiple visual domains with residual adapters , 2017, NIPS.

[38] Sergey Levine,et al. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[39] Amos J. Storkey,et al. How to train your MAML , 2018, ICLR.

[40] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[41] Byoung-Tak Zhang,et al. Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[42] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[43] Kurt Hornik,et al. Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[44] Sepp Hochreiter,et al. Learning to Learn Using Gradient Descent , 2001, ICANN.

[45] Jürgen Schmidhuber,et al. Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[46] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[47] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[48] Yoshua Bengio,et al. Bayesian Model-Agnostic Meta-Learning , 2018, NeurIPS.

[49] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[50] Jitendra Malik,et al. Learning to Optimize , 2016, ICLR.

[51] Thomas L. Griffiths,et al. Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[52] Jorge Nocedal,et al. A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[53] Angelika Steger,et al. Fast-Slow Recurrent Neural Networks , 2017, NIPS.

[54] Sebastian Thrun,et al. Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[55] Hujun Yin,et al. Breaking the Activation Function Bottleneck through Adaptive Parameterization , 2018, NeurIPS.

[56] Sanjeev Arora,et al. On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[57] Geoffrey E. Hinton,et al. Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[58] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[59] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[60] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[61] Kaiming He,et al. Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62] Pieter Abbeel,et al. A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[63] Joshua B. Tenenbaum,et al. One shot learning of simple visual concepts , 2011, CogSci.

[64] Kenneth O. Stanley,et al. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity , 2018, ICLR.

[65] Joshua Achiam,et al. On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[66] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[67] Hans-Georg Zimmermann,et al. Recurrent Neural Networks are Universal approximators , 2007, Int. J. Neural Syst..

[68] Derek Hoiem,et al. Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[70] Hang Li,et al. Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[71] Alexandre Lacoste,et al. TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[72] Marc Teboulle,et al. Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[73] Razvan Pascanu,et al. Natural Neural Networks , 2015, NIPS.

[74] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.