Few-Shot Learning via Learning the Representation, Provably

This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understand how much of a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation. When specialized to linear representation functions, this rate becomes $O\left(\frac{dk}{n_1T} + \frac{k}{n_2}\right)$ where $d (\gg k)$ is the ambient input dimension, which is a substantial improvement over the rate without using representation learning, i.e. over the rate of $O\left(\frac{d}{n_2}\right)$. Second, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural network learning. Our results demonstrate representation learning can fully utilize all $n_1T$ samples from source tasks.

[1]  Sebastian Thrun,et al.  Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[2]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[3]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[4]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[5]  Adi Shraibman,et al.  Rank, Trace-Norm and Max-Norm , 2005, COLT.

[6]  Ji Zhu,et al.  l1 Regularization in Infinite Dimensional Feature Spaces , 2007, COLT.

[7]  Martin J. Wainwright,et al.  Estimation of (near) low-rank matrices with noise and high-dimensional scaling , 2009, ICML.

[8]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[9]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[10]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  René Vidal,et al.  Structured Low-Rank Matrix Factorization: Optimality, Algorithm, and Applications to Image Processing , 2014, ICML.

[12]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[13]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[14]  Lior Wolf,et al.  A theoretical framework for deep transfer learning , 2016 .

[15]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[16]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[17]  Maria-Florina Balcan,et al.  Risk Bounds for Transferring Representations With and Without Fine-Tuning , 2017, ICML.

[18]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Pierre Alquier,et al.  Regret Bounds for Lifelong Learning , 2016, AISTATS.

[20]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[22]  Massimiliano Pontil,et al.  Incremental Learning-to-Learn with Statistical Guarantees , 2018, UAI.

[23]  Roman Vershynin,et al.  Four lectures on probabilistic methods for data science , 2016, IAS/Park City Mathematics Series.

[24]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Luca Bertinetto,et al.  Meta-learning with differentiable closed-form solvers , 2018, ICLR.

[26]  Massimiliano Pontil,et al.  Learning-to-Learn Stochastic Gradient Descent with Biased Regularization , 2019, ICML.

[27]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[28]  Sergey Levine,et al.  Meta-Learning , 2019, Automated Machine Learning.

[29]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[30]  Maria-Florina Balcan,et al.  Adaptive Gradient-Based Meta-Learning Methods , 2019, NeurIPS.

[31]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Michael I. Jordan,et al.  On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[33]  Ruosong Wang,et al.  Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[34]  J. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, NeurIPS.

[35]  Oriol Vinyals,et al.  Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML , 2019, ICLR.

[36]  Michael I. Jordan,et al.  Provable Meta-Learning of Linear Representations , 2020, ICML.