On the Theory of Transfer Learning: The Importance of Task Diversity

We provide new statistical guarantees for transfer learning via representation learning--when transfer is achieved by learning a feature representation shared across different tasks. This enables learning on new tasks using far less data than is required to learn them in isolation. Formally, we consider $t+1$ tasks parameterized by functions of the form $f_j \circ h$ in a general function class $\mathcal{F} \circ \mathcal{H}$, where each $f_j$ is a task-specific function in $\mathcal{F}$ and $h$ is the shared representation in $\mathcal{H}$. Letting $C(\cdot)$ denote the complexity measure of the function class, we show that for diverse training tasks (1) the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(\mathcal{H}) + t C(\mathcal{F})$, despite no explicit access to a signal from the feature representation and (2) with an accurate estimate of the representation, the sample complexity needed to learn a new task scales only with $C(\mathcal{F})$. Our results depend upon a new general notion of task diversity--applicable to models with general tasks, features, and losses--as well as a novel chain rule for Gaussian complexities. Finally, we exhibit the utility of our general framework in several models of importance in the literature.

[1]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[2]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[3]  Massimiliano Pontil,et al.  Learning-to-Learn Stochastic Gradient Descent with Biased Regularization , 2019, ICML.

[4]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[6]  Michael I. Jordan,et al.  Union support recovery in high-dimensional multivariate regression , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[7]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[8]  S. Geer,et al.  Oracle Inequalities and Optimal Inference under Group Sparsity , 2010, 1007.1771.

[9]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[10]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Sergey Levine,et al.  Online Meta-Learning , 2019, ICML.

[12]  Shai Ben-David,et al.  A notion of task relatedness yielding provable multiple-task learning guarantees , 2008, Machine Learning.

[13]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[14]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[15]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[16]  P. Massart,et al.  About the constants in Talagrand's concentration inequalities for empirical processes , 2000 .

[17]  Samy Bengio,et al.  Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML , 2020, ICLR.

[18]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[19]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[20]  Maria-Florina Balcan,et al.  Adaptive Gradient-Based Meta-Learning Methods , 2019, NeurIPS.

[21]  Michael I. Jordan,et al.  Provable Meta-Learning of Linear Representations , 2020, ICML.

[22]  Massimiliano Pontil,et al.  Excess risk bounds for multitask learning with trace norm regularization , 2012, COLT.

[23]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[24]  Amos Storkey,et al.  Meta-Learning in Neural Networks: A Survey , 2020, IEEE transactions on pattern analysis and machine intelligence.

[25]  Andreas Maurer,et al.  A chain rule for the expected suprema of Gaussian processes , 2014, Theor. Comput. Sci..

[26]  Massimiliano Pontil,et al.  Online-Within-Online Meta-Learning , 2019, NeurIPS.

[27]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[28]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[29]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[30]  Claudio Gentile,et al.  Linear Algorithms for Online Multitask Classification , 2010, COLT.

[31]  J. Plotkin,et al.  Inferring the shape of global epistasis , 2018, Proceedings of the National Academy of Sciences.

[32]  Adam Tauman Kalai,et al.  Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[33]  Qi Li,et al.  Nonparametric Econometrics: Theory and Practice , 2006 .

[34]  Maria-Florina Balcan,et al.  Provable Guarantees for Gradient-Based Meta-Learning , 2019, ICML.

[35]  Burkhard Rost,et al.  End-to-end multitask learning, from protein language to protein features without alignments , 2019, bioRxiv.

[36]  Luke S. Zettlemoyer,et al.  Cloze-driven Pretraining of Self-attention Networks , 2019, EMNLP.

[37]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..