Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation

Multi-task learning (MTL) aims to improve the generalization of several related tasks by learning them jointly. As a comparison, in addition to the joint training scheme, modern meta-learning allows unseen tasks with limited labels during the test phase, in the hope of fast adaptation over them. Despite the subtle difference between MTL and meta-learning in the problem formulation, both learning paradigms share the same insight that the shared structure between existing training tasks could lead to better generalization and adaptation. In this paper, we take one important step further to understand the close connection between these two learning paradigms, through both theoretical analysis and empirical investigation. Theoretically, we first demonstrate that MTL shares the same optimization formulation with a class of gradient-based meta-learning (GBML) algorithms. We then prove that for over-parameterized neural networks with sufficient depth, the learned predictive functions of MTL and GBML are close. In particular, this result implies that the predictions given by these two models are similar over the same unseen task. Empirically, we corroborate our theoretical findings by showing that, with proper implementation, MTL is competitive against state-of-the-art GBML algorithms on a set of few-shot image classification benchmarks. Since existing GBML algorithms often involve costly second-order bi-level optimization, our first-order MTL method is an order of magnitude faster on large-scale datasets such as miniImageNet. We believe this work could help bridge the gap between these two learning paradigms, and provide a computationally efficient alternative to GBML that also supports fast task adaptation. University of Illinois at Urbana-Champaign, Urbana, IL, USA. Correspondence to: Haoxiang Wang <hwang264@illinois.edu>, Han Zhao <hanzhao@illinois.edu>, Bo Li <lbo@illinois.edu>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

[1]  Mikhail Khodak,et al.  A Sample Complexity Separation between Non-Convex and Convex Meta-Learning , 2020, ICML.

[2]  Michael I. Jordan,et al.  On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[3]  Samuel S. Schoenholz,et al.  Disentangling trainability and generalization in deep learning , 2019, ArXiv.

[4]  Amos J. Storkey,et al.  How to train your MAML , 2018, ICLR.

[5]  Yue Wang,et al.  Rethinking Few-Shot Image Classification: a Good Embedding Is All You Need? , 2020, ECCV.

[6]  Samy Bengio,et al.  Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML , 2020, ICLR.

[7]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[8]  Maria-Florina Balcan,et al.  Provable Guarantees for Gradient-Based Meta-Learning , 2019, ICML.

[9]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[10]  Maria-Florina Balcan,et al.  Adaptive Gradient-Based Meta-Learning Methods , 2019, NeurIPS.

[11]  Luca Bertinetto,et al.  Meta-learning with differentiable closed-form solvers , 2018, ICLR.

[12]  Jonathan Baxter,et al.  Theoretical Models of Learning to Learn , 1998, Learning to Learn.

[13]  Yifan Hu,et al.  Biased Stochastic Gradient Descent for Conditional Stochastic Optimization , 2020, ArXiv.

[14]  Amos Storkey,et al.  Meta-Learning in Neural Networks: A Survey , 2020, IEEE transactions on pattern analysis and machine intelligence.

[15]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[16]  Haoxiang Wang,et al.  Global Convergence and Induced Kernels of Gradient-Based Meta-Learning with Neural Nets , 2020, ArXiv.

[17]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[18]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[19]  Xiaoou Tang,et al.  Facial Landmark Detection by Deep Multi-task Learning , 2014, ECCV.

[20]  Pan Zhou,et al.  How Important is the Train-Validation Split in Meta-Learning? , 2020, ICML.

[21]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[22]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[23]  Han Zhao,et al.  Efficient Multitask Feature and Relationship Learning , 2017, UAI.

[24]  Aryan Mokhtari,et al.  On the Convergence Theory of Gradient-Based Model-Agnostic Meta-Learning Algorithms , 2019, AISTATS.

[25]  Alexandre Lacoste,et al.  TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[26]  Mikhail Belkin,et al.  Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks , 2020, ICLR.

[27]  Roberto Cipolla,et al.  Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Amin Karbasi,et al.  Meta Learning in the Continuous Time Limit , 2020, AISTATS.

[29]  Hung-yi Lee,et al.  Meta Learning for End-To-End Low-Resource Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Jaehoon Lee,et al.  Neural Tangents: Fast and Easy Infinite Neural Networks in Python , 2019, ICLR.

[31]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[32]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[33]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[34]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[35]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[36]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[37]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[39]  Sébastien M. R. Arnold,et al.  learn2learn: A Library for Meta-Learning Research , 2020, ArXiv.

[40]  Boris Hanin,et al.  Finite Depth and Width Corrections to the Neural Tangent Kernel , 2019, ICLR.

[41]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[42]  Sergey Levine,et al.  Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[43]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[44]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[45]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[46]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[47]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[48]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[49]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[51]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[52]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[53]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[54]  Sergey Levine,et al.  Online Meta-Learning , 2019, ICML.

[55]  Shuicheng Yan,et al.  Efficient Meta Learning via Minibatch Proximal Update , 2019, NeurIPS.

[56]  Yingbin Liang,et al.  Multi-Step Model-Agnostic Meta-Learning: Convergence and Improved Algorithms , 2020, ArXiv.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Micah Goldblum,et al.  Unraveling Meta-Learning: Understanding Feature Representations for Few-Shot Tasks , 2020, ICML.

[59]  Dianhai Yu,et al.  Multi-Task Learning for Multiple Language Translation , 2015, ACL.

[60]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[61]  S. R. Searle,et al.  On Deriving the Inverse of a Sum of Matrices , 1981 .

[62]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[63]  Joshua B. Tenenbaum,et al.  Meta-Learning for Semi-Supervised Few-Shot Classification , 2018, ICLR.

[64]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.