A No-Free-Lunch Theorem for MultiTask Learning

Multitask learning and related areas such as multi-source domain adaptation address modern settings where datasets from $N$ related distributions $\{P_t\}$ are to be combined towards improving performance on any single such distribution ${\cal D}$. A perplexing fact remains in the evolving theory on the subject: while we would hope for performance bounds that account for the contribution from multiple tasks, the vast majority of analyses result in bounds that improve at best in the number $n$ of samples per task, but most often do not improve in $N$. As such, it might seem at first that the distributional settings or aggregation procedures considered in such analyses might be somehow unfavorable; however, as we show, the picture happens to be more nuanced, with interestingly hard regimes that might appear otherwise favorable. In particular, we consider a seemingly favorable classification scenario where all tasks $P_t$ share a common optimal classifier $h^*,$ and which can be shown to admit a broad range of regimes with improved oracle rates in terms of $N$ and $n$. Some of our main results are as follows: $\bullet$ We show that, even though such regimes admit minimax rates accounting for both $n$ and $N$, no adaptive algorithm exists; that is, without access to distributional information, no algorithm can guarantee rates that improve with large $N$ for $n$ fixed. $\bullet$ With a bit of additional information, namely, a ranking of tasks $\{P_t\}$ according to their distance to a target ${\cal D}$, a simple rank-based procedure can achieve near optimal aggregations of tasks' datasets, despite a search space exponential in $N$. Interestingly, the optimal aggregation might exclude certain tasks, even though they all share the same $h^*$.

[1]  E. Rio,et al.  Concentration around the mean for maxima of empirical processes , 2005, math/0506594.

[2]  Bernhard Schölkopf,et al.  Domain Generalization via Invariant Feature Representation , 2013, ICML.

[3]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[4]  Jaime G. Carbonell,et al.  A theory of transfer learning with applications to active learning , 2013, Machine Learning.

[5]  Christoph H. Lampert,et al.  Tasks Without Borders : A New Approach to Online MultiTask Learning , 2019 .

[6]  Jianxin Zhang,et al.  Learning from Multiple Corrupted Sources, with Application to Learning from Label Proportions , 2019, ArXiv.

[7]  Yishay Mansour,et al.  A parametrization scheme for classifying models of learnability , 1989, COLT '89.

[8]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[9]  Massimiliano Pontil,et al.  Sparse coding for multitask and transfer learning , 2012, ICML.

[10]  Ali Jalali,et al.  A Dirty Model for Multi-task Learning , 2010, NIPS.

[11]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[12]  Yishay Mansour,et al.  Multiple Source Adaptation and the Rényi Divergence , 2009, UAI.

[13]  Steve Hanneke,et al.  On the Value of Target Data in Transfer Learning , 2020, NeurIPS.

[14]  Saeed Mahloujifar,et al.  Universal Multi-Party Poisoning Attacks , 2018, ICML 2019.

[15]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[16]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[17]  Maria-Florina Balcan,et al.  Risk Bounds for Transferring Representations With and Without Fine-Tuning , 2017, ICML.

[18]  Martin J. Wainwright,et al.  Simultaneous Support Recovery in High Dimensions: Benefits and Perils of Block $\ell _{1}/\ell _{\infty} $-Regularization , 2009, IEEE Transactions on Information Theory.

[19]  Tyler Lu,et al.  Impossibility Theorems for Domain Adaptation , 2010, AISTATS.

[20]  E. Slud Distribution Inequalities for the Binomial Law , 1977 .

[21]  Mingda Qiao Do Outliers Ruin Collaboration? , 2018, ICML.

[22]  Ariel D. Procaccia,et al.  Collaborative PAC Learning , 2017, NIPS.

[23]  Michael I. Jordan,et al.  On the Theory of Transfer Learning: The Importance of Task Diversity , 2020, NeurIPS.

[24]  Massimiliano Pontil,et al.  Excess risk bounds for multitask learning with trace norm regularization , 2012, COLT.

[25]  Peter L. Bartlett,et al.  Learning with a slowly changing distribution , 1992, COLT '92.

[26]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[27]  Jian Shen,et al.  Wasserstein Distance Guided Representation Learning for Domain Adaptation , 2017, AAAI.

[28]  Ievgen Redko,et al.  Theoretical Analysis of Domain Adaptation with Optimal Transport , 2016, ECML/PKDD.

[29]  Jonathan Baxter,et al.  A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling , 1997, Machine Learning.

[30]  Dan Alistarh,et al.  On the Sample Complexity of Adversarial Multi-Source PAC Learning , 2020, ICML.

[31]  Philip M. Long,et al.  On the complexity of learning from drifting distributions , 1997, COLT '96.

[32]  Samory Kpotufe,et al.  Marginal Singularity, and the Benefits of Labels in Covariate-Shift , 2018, COLT.

[33]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[34]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[35]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[36]  S. Geer,et al.  Oracle Inequalities and Optimal Inference under Group Sparsity , 2010, 1007.1771.

[37]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[38]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[39]  Stefano Soatto,et al.  The Information Complexity of Learning Tasks, their Structure and their Distance , 2019, Information and Inference: A Journal of the IMA.

[40]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[41]  Mehryar Mohri,et al.  New Analysis and Algorithm for Learning with Drifting Distributions , 2012, ALT.

[42]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[43]  Koby Crammer,et al.  Learning from Multiple Sources , 2006, NIPS.

[44]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[45]  M. Wainwright,et al.  Simultaneous support recovery in high dimensions : Benefits and perils of block l 1 / l ∞-regularization , 2009 .

[46]  R. Tate On a Double Inequality of the Normal Distribution , 1953 .

[47]  Liu Yang,et al.  Statistical Learning under Nonstationary Mixing Processes , 2015, AISTATS.

[48]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[49]  Christoph H. Lampert,et al.  A PAC-Bayesian bound for Lifelong Learning , 2013, ICML.

[50]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[51]  Mehryar Mohri,et al.  Sample Selection Bias Correction Theory , 2008, ALT.

[52]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[53]  Shai Ben-David,et al.  A notion of task relatedness yielding provable multiple-task learning guarantees , 2008, Machine Learning.