Global Convergence and Induced Kernels of Gradient-Based Meta-Learning with Neural Nets

Gradient-based meta-learning (GBML) with deep neural nets (DNNs) has become a popular approach for few-shot learning. However, due to the non-convexity of DNNs and the complex bi-level optimization in GBML, the theoretical properties of GBML with DNNs remain largely unknown. In this paper, we first develop a novel theoretical analysis to answer the following questions: Does GBML with DNNs have global convergence guarantees? We provide a positive answer to this question by proving that GBML with over-parameterized DNNs is guaranteed to converge to global optima at a linear rate. The second question we aim to address is: How does GBML achieve fast adaption to new tasks with experience on past similar tasks? To answer it, we prove that GBML is equivalent to a functional gradient descent operation that explicitly propagates experience from the past tasks to new ones. Finally, inspired by our theoretical analysis, we develop a new kernel-based meta-learning approach. We show that the proposed approach outperforms GBML with standard DNNs on the Omniglot dataset when the number of past tasks for meta-training is small. The code is available at this https URL AI-secure/Meta-Neural-Kernel .

[1]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[2]  Joaquin Vanschoren,et al.  Meta-Learning: A Survey , 2018, Automated Machine Learning.

[3]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[4]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[5]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[6]  Maria-Florina Balcan,et al.  Provable Guarantees for Gradient-Based Meta-Learning , 2019, ICML.

[7]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[8]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[9]  R. Srikant,et al.  Revisiting Landscape Analysis in Deep Neural Networks: Eliminating Decreasing Paths to Infinity , 2019, SIAM J. Optim..

[10]  Dawei Li,et al.  On the Benefit of Width for Neural Networks: Disappearance of Basins , 2018, SIAM J. Optim..

[11]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[12]  Chelsea Finn,et al.  Learning to Learn with Gradients , 2018 .

[13]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[14]  Maria-Florina Balcan,et al.  Adaptive Gradient-Based Meta-Learning Methods , 2019, NeurIPS.

[15]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[16]  Arthur Jacot,et al.  The asymptotic spectrum of the Hessian of DNN throughout training , 2020, ICLR.

[17]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[18]  Junjie Yang,et al.  Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning , 2020 .

[19]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[20]  Sergey Levine,et al.  Online Meta-Learning , 2019, ICML.

[21]  Jonathan Ragan-Kelley,et al.  Neural Kernels Without Tangents , 2020, ICML.

[22]  R. Srikant,et al.  Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[23]  Leslie Pack Kaelbling,et al.  Elimination of All Bad Local Minima in Deep Learning , 2019, AISTATS.

[24]  Amin Karbasi,et al.  Meta Learning in the Continuous Time Limit , 2020, AISTATS.

[25]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[26]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[27]  Matthias Hein,et al.  On the loss landscape of a class of deep neural networks with no bad local valleys , 2018, ICLR.

[28]  Yifan Hu,et al.  Biased Stochastic Gradient Descent for Conditional Stochastic Optimization , 2020, ArXiv.

[29]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[30]  Ruosong Wang,et al.  Enhanced Convolutional Neural Tangent Kernels , 2019, ArXiv.

[31]  Sergey Levine,et al.  Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[32]  Shuicheng Yan,et al.  Efficient Meta Learning via Minibatch Proximal Update , 2019, NeurIPS.

[33]  Zhaoran Wang,et al.  On the Global Optimality of Model-Agnostic Meta-Learning , 2020, ICML.

[34]  B. Kågström Bounds and perturbation bounds for the matrix exponential , 1977 .

[35]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[36]  Junjie Yang,et al.  Multi-Step Model-Agnostic Meta-Learning: Convergence and Improved Algorithms , 2020, ArXiv.

[37]  Razvan Pascanu,et al.  Meta-Learning with Warped Gradient Descent , 2020, ICLR.

[38]  Aryan Mokhtari,et al.  On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms , 2019, AISTATS.

[39]  Junier B. Oliva,et al.  Meta-Curvature , 2019, NeurIPS.

[40]  Xiaolin Huang,et al.  Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Lionel M. Ni,et al.  Generalizing from a Few Examples , 2020, ACM Comput. Surv..

[42]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[43]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[44]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[45]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[46]  R. Srikant,et al.  Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[47]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[48]  Sergey Levine,et al.  Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[49]  Bernhard Schölkopf,et al.  Domain Generalization via Invariant Feature Representation , 2013, ICML.

[50]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Artem Molchanov,et al.  Generalized Inner Loop Meta-Learning , 2019, ArXiv.

[53]  Ruoyu Sun,et al.  Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[54]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[55]  Ruosong Wang,et al.  Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[56]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[57]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.