论文信息 - Global Convergence and Generalization Bound of Gradient-Based Meta-Learning with Deep Neural Nets - 字舞流文

Global Convergence and Generalization Bound of Gradient-Based Meta-Learning with Deep Neural Nets

Gradient-based meta-learning (GBML) with deep neural nets (DNNs) has become a popular approach for few-shot learning. However, due to the non-convexity of DNNs and the bi-level optimization in GBML, the theoretical properties of GBML with DNNs remain largely unknown. In this paper, we first aim to answer the following question: Does GBML with DNNs have global convergence guarantees? We provide a positive answer to this question by proving that GBML with over-parameterized DNNs is guaranteed to converge to global optima at a linear rate. The second question we aim to address is: How does GBML achieve fast adaption to new tasks with prior experience on past tasks? To answer it, we theoretically show that GBML is equivalent to a functional gradient descent operation that explicitly propagates experience from the past tasks to new ones, and then we prove a generalization error bound of GBML with over-parameterized DNNs.

Haoxiang Wang | Bo Li | Ruoyu Sun | Haoxiang Wang | Ruoyu Sun | Bo Li

[1] Mikhail Khodak,et al. A Sample Complexity Separation between Non-Convex and Convex Meta-Learning , 2020, ICML.

[2] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[3] Sergey Levine,et al. Online Meta-Learning , 2019, ICML.

[4] Amin Karbasi,et al. Meta Learning in the Continuous Time Limit , 2020, AISTATS.

[5] Ruoyu Sun,et al. Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[6] Claudio Gentile,et al. On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[7] Ruosong Wang,et al. Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[8] Matthias Hein,et al. On the loss landscape of a class of deep neural networks with no bad local valleys , 2018, ICLR.

[9] Paolo Frasconi,et al. Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[10] Yuanzhi Li,et al. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[11] Junjie Yang,et al. Multi-Step Model-Agnostic Meta-Learning: Convergence and Improved Algorithms , 2020, ArXiv.

[12] Yuan Cao,et al. Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[13] Razvan Pascanu,et al. Meta-Learning with Warped Gradient Descent , 2020, ICLR.

[14] Matus Telgarsky,et al. Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[15] Massimiliano Pontil,et al. The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[16] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[17] Joaquin Vanschoren,et al. Meta-Learning: A Survey , 2018, Automated Machine Learning.

[18] Bernhard Schölkopf,et al. Domain Generalization via Invariant Feature Representation , 2013, ICML.

[19] Joshua Achiam,et al. On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21] Artem Molchanov,et al. Generalized Inner Loop Meta-Learning , 2019, ArXiv.

[22] Sebastian Thrun,et al. Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[23] Lionel M. Ni,et al. Generalizing from a Few Examples , 2020, ACM Comput. Surv..

[24] Jaehoon Lee,et al. Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[25] Yifan Hu,et al. Biased Stochastic Gradient Descent for Conditional Stochastic Optimization , 2020, ArXiv.

[26] Zhaoran Wang,et al. On the Global Optimality of Model-Agnostic Meta-Learning , 2020, ICML.

[27] B. Kågström. Bounds and perturbation bounds for the matrix exponential , 1977 .

[28] L. Meng,et al. The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm , 2010 .

[29] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[30] Maria-Florina Balcan,et al. Provable Guarantees for Gradient-Based Meta-Learning , 2019, ICML.

[31] Ruosong Wang,et al. Enhanced Convolutional Neural Tangent Kernels , 2019, ArXiv.

[32] Andrew McCallum,et al. Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks , 2020, COLING.

[33] Joshua B. Tenenbaum,et al. Human-level concept learning through probabilistic program induction , 2015, Science.

[34] R. Srikant,et al. Revisiting Landscape Analysis in Deep Neural Networks: Eliminating Decreasing Paths to Infinity , 2019, SIAM J. Optim..

[35] Leslie Pack Kaelbling,et al. Elimination of All Bad Local Minima in Deep Learning , 2019, AISTATS.

[36] Maria-Florina Balcan,et al. Adaptive Gradient-Based Meta-Learning Methods , 2019, NeurIPS.

[37] Yu Cheng,et al. Diverse Few-Shot Text Classification with Multiple Metrics , 2018, NAACL.

[38] Sergey Levine,et al. Probabilistic Model-Agnostic Meta-Learning , 2018, NeurIPS.

[39] Jonathan Baxter,et al. A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[40] Shuicheng Yan,et al. Efficient Meta Learning via Minibatch Proximal Update , 2019, NeurIPS.

[41] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[42] Dawei Li,et al. On the Benefit of Width for Neural Networks: Disappearance of Basins , 2018, SIAM J. Optim..

[43] R. Srikant,et al. Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[44] Sergey Levine,et al. Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[45] Arthur Jacot,et al. The asymptotic spectrum of the Hessian of DNN throughout training , 2020, ICLR.

[46] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[47] Quanquan Gu,et al. Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[48] R. Srikant,et al. Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[49] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.