论文信息 - Few-Shot Learning via Learning the Representation, Provably

Few-Shot Learning via Learning the Representation, Provably

This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understand how much of a sample size reduction is possible. First, we study the setting where this common representation is low-dimensional and provide a fast rate of $O\left(\frac{\mathcal{C}\left(\Phi\right)}{n_1T} + \frac{k}{n_2}\right)$; here, $\Phi$ is the representation function class, $\mathcal{C}\left(\Phi\right)$ is its complexity measure, and $k$ is the dimension of the representation. When specialized to linear representation functions, this rate becomes $O\left(\frac{dk}{n_1T} + \frac{k}{n_2}\right)$ where $d (\gg k)$ is the ambient input dimension, which is a substantial improvement over the rate without using representation learning, i.e. over the rate of $O\left(\frac{d}{n_2}\right)$. Second, we consider the setting where the common representation may be high-dimensional but is capacity-constrained (say in norm); here, we again demonstrate the advantage of representation learning in both high-dimensional linear regression and neural network learning. Our results demonstrate representation learning can fully utilize all $n_1T$ samples from source tasks.

[1] Sebastian Thrun,et al. Learning to Learn: Introduction and Overview , 1998, Learning to Learn.

[2] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[3] Jonathan Baxter,et al. A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[4] Nicolas Le Roux,et al. Convex Neural Networks , 2005, NIPS.

[5] Adi Shraibman,et al. Rank, Trace-Norm and Max-Norm , 2005, COLT.

[6] Ji Zhu,et al. l1 Regularization in Infinite Dimensional Feature Spaces , 2007, COLT.

[7] Martin J. Wainwright,et al. Estimation of (near) low-rank matrices with noise and high-dimensional scaling , 2009, ICML.

[8] Sham M. Kakade,et al. A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[9] Sham M. Kakade,et al. Random Design Analysis of Ridge Regression , 2012, COLT.

[10] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] René Vidal,et al. Structured Low-Rank Matrix Factorization: Optimality, Algorithm, and Applications to Image Processing , 2014, ICML.

[12] Joel A. Tropp,et al. An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[13] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[14] Lior Wolf,et al. A theoretical framework for deep transfer learning , 2016 .

[15] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[16] Massimiliano Pontil,et al. The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[17] Maria-Florina Balcan,et al. Risk Bounds for Transferring Representations With and Without Fine-Tuning , 2017, ICML.

[18] Andrew Zisserman,et al. Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19] Pierre Alquier,et al. Regret Bounds for Lifelong Learning , 2016, AISTATS.

[20] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[22] Massimiliano Pontil,et al. Incremental Learning-to-Learn with Statistical Guarantees , 2018, UAI.

[23] Roman Vershynin,et al. Four lectures on probabilistic methods for data science , 2016, IAS/Park City Mathematics Series.

[24] Abhinav Gupta,et al. Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25] Luca Bertinetto,et al. Meta-learning with differentiable closed-form solvers , 2018, ICLR.

[26] Massimiliano Pontil,et al. Learning-to-Learn Stochastic Gradient Descent with Biased Regularization , 2019, ICML.

[27] Mikhail Khodak,et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[28] Sergey Levine,et al. Meta-Learning , 2019, Automated Machine Learning.

[29] Colin Wei,et al. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[30] Maria-Florina Balcan,et al. Adaptive Gradient-Based Meta-Learning Methods , 2019, NeurIPS.

[31] Subhransu Maji,et al. Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Michael I. Jordan,et al. On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[33] Ruosong Wang,et al. Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[34] J. Lee,et al. Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, NeurIPS.

[35] Oriol Vinyals,et al. Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML , 2019, ICLR.

[36] Michael I. Jordan,et al. Provable Meta-Learning of Linear Representations , 2020, ICML.