Random Features Model with General Convex Regularization: A Fine Grained Analysis with Precise Asymptotic Learning Curves

We compute precise asymptotic expressions for the learning curves of least squares random feature (RF) models with either a separable strongly convex regularization or the $\ell_1$ regularization. We propose a novel multi-level application of the convex Gaussian min max theorem (CGMT) to overcome the traditional difficulty of finding computable expressions for random features models with correlated data. Our result takes the form of a computable 4-dimensional scalar optimization. In contrast to previous results, our approach does not require solving an often intractable proximal operator, which scales with the number of model parameters. Furthermore, we extend the universality results for the training and generalization errors for RF models to $\ell_1$ regularization. In particular, we demonstrate that under mild conditions, random feature models with elastic net or $\ell_1$ regularization are asymptotically equivalent to a surrogate Gaussian model with the same first and second moments. We numerically demonstrate the predictive capacity of our results, and show experimentally that the predicted test error is accurate even in the non-asymptotic regime.

[1]  Murat A. Erdogdu,et al.  High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , 2022, NeurIPS.

[2]  Lingfeng Niu,et al.  A Survey for Sparse Regularization Based Compression Methods , 2022, Annals of Data Science.

[3]  A. Montanari,et al.  Universality of empirical risk minimization , 2022, COLT.

[4]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[5]  Florent Krzakala,et al.  Learning curves of generic features maps for realistic datasets with a teacher-student model , 2021, NeurIPS.

[6]  Christos Thrampoulidis,et al.  Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks , 2020, AAAI.

[7]  P. Bartlett,et al.  Benign overfitting in ridge regression , 2020, J. Mach. Learn. Res..

[8]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[9]  Yue M. Lu,et al.  Universality Laws for High-Dimensional Learning With Random Features , 2020, IEEE Transactions on Information Theory.

[10]  Abd El Rahman Shabayek,et al.  Deep network compression with teacher latent subspace learning and LASSO , 2020, Applied Intelligence.

[11]  Yue M. Lu,et al.  A Precise Performance Analysis of Learning with Random Features , 2020, ArXiv.

[12]  Florent Krzakala,et al.  The Gaussian equivalence of generative models for learning with shallow neural networks , 2020, MSML.

[13]  Florent Krzakala,et al.  The Gaussian equivalence of generative models for learning with two-layer neural networks , 2020, ArXiv.

[14]  Christos Thrampoulidis,et al.  Fundamental Limits of Ridge-Regularized Empirical Risk Minimization in High Dimensions , 2020, AISTATS.

[15]  Jesse H. Krijthe,et al.  A brief prehistory of double descent , 2020, Proceedings of the National Academy of Sciences.

[16]  Panagiotis Lolas,et al.  Regularization in High-Dimensional Regression and Classification via Random Matrix Theory , 2020, 2003.13723.

[17]  G. Biroli,et al.  Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.

[18]  Florent Krzakala,et al.  The role of regularization in classification of high-dimensional noisy Gaussian mixture , 2020, ICML.

[19]  Christos Thrampoulidis,et al.  Sharp Asymptotics and Optimal Performance for Inference in Binary Models , 2020, AISTATS.

[20]  Tengyuan Liang,et al.  A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers , 2020, SSRN Electronic Journal.

[21]  Christos Thrampoulidis,et al.  Analytic Study of Double Descent in Binary Classification: The Impact of Loss , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[22]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, Information and Inference: A Journal of the IMA.

[23]  A. Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[24]  F. Krzakala,et al.  Modeling the Influence of Data Structure on Learning in Neural Networks: The Hidden Manifold Model , 2019, Physical Review X.

[25]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[26]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[27]  Babak Hassibi,et al.  The Impact of Regularization on High-dimensional Logistic Regression , 2019, NeurIPS.

[28]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[29]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[30]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[31]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[32]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[33]  D. Kobak,et al.  Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization , 2018, 1805.10939.

[34]  Christos Thrampoulidis,et al.  Precise Error Analysis of Regularized $M$ -Estimators in High Dimensions , 2016, IEEE Transactions on Information Theory.

[35]  Joel A. Tropp,et al.  Universality laws for randomized dimension reduction, with applications , 2015, ArXiv.

[36]  Christos Thrampoulidis,et al.  Asymptotically exact error analysis for the generalized equation-LASSO , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[37]  Christos Thrampoulidis,et al.  A Tight Version of the Gaussian min-max theorem in the Presence of Convexity , 2014, ArXiv.

[38]  Tommi S. Jaakkola,et al.  Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees , 2014, AISTATS.

[39]  Jun Yu,et al.  Click Prediction for Web Image Reranking Using Multimodal Sparse Coding , 2014, IEEE Transactions on Image Processing.

[40]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[41]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[42]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[43]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[44]  E. Candès,et al.  Stable signal recovery from incomplete and inaccurate measurements , 2005, math/0503066.

[45]  Y. Gordon Some inequalities for Gaussian processes and applications , 1985 .

[46]  Yue M. Lu A Precise Performance Analysis of Learning with Random Features , 2020 .

[47]  Babak Hassibi,et al.  A Universal Analysis of Large-Scale Regularized Least Squares Solutions , 2017, NIPS.

[48]  Y. Gordon On Milman's inequality and random subspaces which escape through a mesh in ℝ n , 1988 .