Robust Meta-learning for Mixed Linear Regression with Small Batches

A common challenge faced in practical supervised learning, such as medical image processing and robotic interactions, is that there are plenty of tasks but each task cannot afford to collect enough labeled examples to be learned in isolation. However, by exploiting the similarities across those tasks, one can hope to overcome such data scarcity. Under a canonical scenario where each task is drawn from a mixture of k linear regressions, we study a fundamental question: can abundant small-data tasks compensate for the lack of big-data tasks? Existing second moment based approaches show that such a trade-off is efficiently achievable, with the help of medium-sized tasks with $\Omega(k^{1/2})$ examples each. However, this algorithm is brittle in two important scenarios. The predictions can be arbitrarily bad (i) even with only a few outliers in the dataset; or (ii) even if the medium-sized tasks are slightly smaller with $o(k^{1/2})$ examples each. We introduce a spectral approach that is simultaneously robust under both scenarios. To this end, we first design a novel outlier-robust principal component analysis algorithm that achieves an optimal accuracy. This is followed by a sum-of-squares algorithm to exploit the information from higher order moments. Together, this approach is robust against outliers and achieves a graceful statistical trade-off; the lack of $\Omega(k^{1/2})$-size tasks can be compensated for with smaller tasks, which can now be as small as $O(\log k)$.

[1]  Arnak S. Dalalyan,et al.  Outlier-robust estimation of a sparse linear model using 𝓁1-penalized Huber's M-estimator , 2019, NeurIPS.

[2]  Massimiliano Pontil,et al.  Excess risk bounds for multitask learning with trace norm regularization , 2012, COLT.

[3]  Sujay Sanghavi,et al.  Iterative Least Trimmed Squares for Mixed Linear Regression , 2019, NeurIPS.

[4]  Liu Liu,et al.  High Dimensional Robust Sparse Regression , 2018, AISTATS.

[5]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[6]  Shie Mannor,et al.  Outlier-Robust PCA: The High-Dimensional Case , 2013, IEEE Transactions on Information Theory.

[7]  Jerry Li,et al.  Being Robust (in High Dimensions) Can Be Practical , 2017, ICML.

[8]  Pravesh Kothari,et al.  Efficient Algorithms for Outlier-Robust Regression , 2018, COLT.

[9]  Bradley P. Carlin,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..

[10]  Inderjit S. Dhillon,et al.  Mixed Linear Regression with Multiple Components , 2016, NIPS.

[11]  Prasad Raghavendra,et al.  List Decodable Learning via Sum of Squares , 2019, SODA.

[12]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[13]  Pravesh Kothari,et al.  Robust moment estimation and improved clustering via sum of squares , 2018, STOC.

[14]  Sidhanth Mohanty,et al.  List Decodable Mean Estimation in Nearly Linear Time , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[15]  Percy Liang,et al.  Spectral Experts for Estimating Mixtures of Linear Regressions , 2013, ICML.

[16]  Yuanzhi Li,et al.  Learning Mixtures of Linear Regressions with Nearly Optimal Complexity , 2018, COLT.

[17]  Daniel M. Kane,et al.  List-decodable robust mean estimation and learning mixtures of spherical gaussians , 2017, STOC.

[18]  Constantine Caramanis,et al.  Solving a Mixture of Many Random Linear Equations by Tensor Decomposition and Alternating Minimization , 2016, ArXiv.

[19]  Aravindan Vijayaraghavan,et al.  On Learning Mixtures of Well-Separated Gaussians , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[20]  Alexandre Lacoste,et al.  TADAM: Task dependent adaptive metric for improved few-shot learning , 2018, NeurIPS.

[21]  Emmanuel J. Candès,et al.  Tight Oracle Inequalities for Low-Rank Matrix Recovery From a Minimal Number of Noisy Random Measurements , 2011, IEEE Transactions on Information Theory.

[22]  J. Steinhardt Lecture Notes for STAT260 (Robust Statistics) , 2019 .

[23]  Hugo Larochelle,et al.  Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples , 2019, ICLR.

[24]  Shahar Mendelson,et al.  Mean Estimation and Regression Under Heavy-Tailed Distributions: A Survey , 2019, Found. Comput. Math..

[25]  Prateek Jain,et al.  Consistent Robust Regression , 2017, NIPS.

[26]  Prateek Jain,et al.  Robust Regression via Hard Thresholding , 2015, NIPS.

[27]  Gregory Valiant,et al.  Estimating Learnability in the Sublinear Data Regime , 2018, NeurIPS.

[28]  Pravesh Kothari,et al.  List-Decodable Subspace Recovery via Sum-of-Squares , 2020, ArXiv.

[29]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[30]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[31]  Ilias Diakonikolas,et al.  Robustly Learning any Clusterable Mixture of Gaussians , 2020, ArXiv.

[32]  Prasad Raghavendra,et al.  High-dimensional estimation via sum-of-squares proofs , 2018, Proceedings of the International Congress of Mathematicians (ICM 2018).

[33]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.

[34]  Weihao Kong,et al.  Sublinear Optimal Policy Value Estimation in Contextual Bandits , 2019, AISTATS.

[35]  Pravesh Kothari,et al.  Better Agnostic Clustering Via Relaxed Tensor Norms , 2017, ArXiv.

[36]  Anima Anandkumar,et al.  Provable Tensor Methods for Learning Mixtures of Generalized Linear Models , 2014, AISTATS.

[37]  Michael I. Jordan,et al.  Provable Meta-Learning of Linear Representations , 2020, ICML.

[38]  Gregory Valiant,et al.  Learning from untrusted data , 2016, STOC.

[39]  Adam R. Klivans,et al.  List-Decodable Linear Regression , 2019, NeurIPS.

[40]  Shuicheng Yan,et al.  Robust PCA in High-dimension: A Deterministic Approach , 2012, ICML.

[41]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[42]  Prateek Jain,et al.  Globally-convergent Iteratively Reweighted Least Squares for Robust Regression Problems , 2019, AISTATS.

[43]  He Jia,et al.  Robustly Clustering a Mixture of Gaussians , 2019, ArXiv.

[44]  Prateek Jain,et al.  Thresholding based Efficient Outlier Robust PCA , 2017, ArXiv.

[45]  Jerry Li,et al.  Mixture models, robustness, and sum of squares proofs , 2017, STOC.

[46]  Razvan Pascanu,et al.  Meta-Learning with Latent Embedding Optimization , 2018, ICLR.

[47]  Samuel B. Hopkins Mean estimation with sub-Gaussian rates in polynomial time , 2018, The Annals of Statistics.

[48]  Geoffrey J. Gordon,et al.  Closed-form supervised dimensionality reduction with generalized linear models , 2008, ICML '08.

[49]  Thomas L. Griffiths,et al.  Recasting Gradient-Based Meta-Learning as Hierarchical Bayes , 2018, ICLR.

[50]  Jerry Li,et al.  Sever: A Robust Meta-Algorithm for Stochastic Optimization , 2018, ICML.

[51]  Gregory Valiant,et al.  Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers , 2017, ITCS.

[52]  Alon Orlitsky,et al.  Supervised dimensionality reduction using mixture models , 2005, ICML.

[53]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[54]  Santosh S. Vempala,et al.  Efficient Representations for Lifelong Learning and Autoencoding , 2014, COLT.

[55]  Ilias Diakonikolas,et al.  Efficient Algorithms and Lower Bounds for Robust Linear Regression , 2018, SODA.

[56]  Yu Cheng,et al.  High-Dimensional Robust Mean Estimation in Nearly-Linear Time , 2018, SODA.

[57]  Yuanzhi Li,et al.  Even Faster SVD Decomposition Yet Without Agonizing Pain , 2016, NIPS.

[58]  Peter L. Bartlett,et al.  Fast Mean Estimation with Sub-Gaussian Rates , 2019, COLT.

[59]  Sham M. Kakade,et al.  Few-Shot Learning via Learning the Representation, Provably , 2020, ICLR.

[60]  Matthijs Douze,et al.  Large-scale image classification with trace-norm regularization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Jerry Li,et al.  Computationally Efficient Robust Sparse Estimation in High Dimensions , 2017, COLT.

[62]  Sivaraman Balakrishnan,et al.  Robust estimation via robust gradient estimation , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[63]  Adam Tauman Kalai,et al.  Generalize Across Tasks: Efficient Algorithms for Linear Representation Learning , 2019, ALT.

[64]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[65]  G. Lugosi,et al.  Robust multivariate mean estimation: The optimality of trimmed mean , 2019, The Annals of Statistics.

[66]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[67]  Chao Gao Robust regression via mutivariate regression depth , 2017, Bernoulli.

[68]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[69]  Martial Hebert,et al.  Learning to Model the Tail , 2017, NIPS.

[70]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[71]  Huan Xu,et al.  A Unified Framework for Outlier-Robust PCA-like Algorithms , 2015, ICML.

[72]  Eric Price,et al.  Compressed Sensing with Adversarial Sparse Noise via L1 Regression , 2018, SOSA.

[73]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[74]  Zhao Song,et al.  Learning mixtures of linear regressions in subexponential time via Fourier moments , 2019, STOC.

[75]  Weihao Kong,et al.  Meta-learning for mixed linear regression , 2020, ICML.