Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting, and Regularization

Deep neural networks generalize well despite being exceedingly overparameterized and being trained without explicit regularization. This curious phenomenon has inspired extensive research activity in establishing its statistical principles: Under what conditions is it observed? How do these depend on the data and on the training algorithm? When does regularization benefit generalization? While such questions remain wide open for deep neural nets, recent works have attempted gaining insights by studying simpler, often linear, models. Our paper contributes to this growing line of work by examining binary linear classification under a generative Gaussian mixture model. Motivated by recent results on the implicit bias of gradient descent, we study both max-margin SVM classifiers (corresponding to logistic loss) and min-norm interpolating classifiers (corresponding to least-squares loss). First, we leverage an idea introduced in [V. Muthukumar et al., arXiv:2005.08054, (2020)] to relate the SVM solution to the min-norm interpolating solution. Second, we derive novel non-asymptotic bounds on the classification error of the latter. Combining the two, we present novel sufficient conditions on the covariance spectrum and on the signal-to-noise ratio (SNR) under which interpolating estimators achieve asymptotically optimal performance as overparameterization increases. Interestingly, our results extend to a noisy model with constant probability noise flips. Contrary to previously studied discriminative data models, our results emphasize the crucial role of the SNR and its interplay with the data covariance. Finally, via a combination of analytical arguments and numerical demonstrations we identify conditions under which the interpolating estimator performs better than corresponding regularized estimates.

[1]  Clayton Sanford,et al.  Support vector machines and linear regression coincide with very high-dimensional features , 2021, NeurIPS.

[2]  Mikhail Belkin,et al.  Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures , 2021, NeurIPS.

[3]  Vladimir Braverman,et al.  Benign Overfitting of Constant-Stepsize SGD for Linear Regression , 2021, COLT.

[4]  Nicolas Flammarion,et al.  Last iterate convergence of SGD for Least-Squares in the Interpolation regime , 2021, NeurIPS.

[5]  Christos Thrampoulidis,et al.  Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks , 2020, AAAI.

[6]  Philip M. Long,et al.  When does gradient descent with logistic loss find interpolating two-layer networks? , 2020, J. Mach. Learn. Res..

[7]  Christos Thrampoulidis,et al.  Benign Overfitting in Binary Classification of Gaussian Mixtures , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  P. Bartlett,et al.  Benign overfitting in ridge regression , 2020, J. Mach. Learn. Res..

[9]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[10]  Daniel J. Hsu,et al.  On the proliferation of support vectors in high dimensions , 2020, AISTATS.

[11]  Babak Hassibi,et al.  The Performance Analysis of Generalized Margin Maximizer (GMM) on Separable Data , 2020, ICML.

[12]  Michael W. Mahoney,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[13]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[14]  Murat A. Erdogdu,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[15]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, J. Mach. Learn. Res..

[16]  Jesse H. Krijthe,et al.  A brief prehistory of double descent , 2020, Proceedings of the National Academy of Sciences.

[17]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[18]  Mohamed-Slim Alouini,et al.  On the Precise Error Analysis of Support Vector Machines , 2020, IEEE Open Journal of Signal Processing.

[19]  Chong You,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[20]  Florent Krzakala,et al.  The role of regularization in classification of high-dimensional noisy Gaussian mixture , 2020, ICML.

[21]  Christos Thrampoulidis,et al.  Analytic Study of Double Descent in Binary Classification: The Impact of Loss , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[22]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[23]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, Information and Inference: A Journal of the IMA.

[24]  A. Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[25]  Tengyuan Liang,et al.  On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, ArXiv.

[26]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[27]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[28]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[29]  Babak Hassibi,et al.  The Impact of Regularization on High-dimensional Logistic Regression , 2019, NeurIPS.

[30]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[31]  T. Hastie,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[32]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[33]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[34]  Daniel J. Hsu,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[35]  D. Kobak,et al.  Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization , 2018, 1805.10939.

[36]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[37]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[38]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[39]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[40]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[41]  Christos Thrampoulidis,et al.  Precise Error Analysis of Regularized $M$ -Estimators in High Dimensions , 2016, IEEE Transactions on Information Theory.

[42]  Christos Thrampoulidis,et al.  Regularized Linear Regression: A Precise Analysis of the Estimation Error , 2015, COLT.

[43]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[44]  Christos Thrampoulidis,et al.  The squared-error of generalized LASSO: A precise analysis , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[45]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013, 1306.2872.

[46]  Mihailo Stojnic,et al.  A framework to characterize performance of LASSO algorithms , 2013, ArXiv.

[47]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[48]  Jiashun Jin Impossibility of successful classification when useful features are rare and weak , 2009, Proceedings of the National Academy of Sciences.

[49]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[50]  R. Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[51]  Ji Zhu,et al.  Margin Maximizing Loss Functions , 2003, NIPS.

[52]  J. Stranlund,et al.  Economic inequality and burden-sharing in the provision of local environmental quality , 2002 .

[53]  Robert P. W. Duin,et al.  Classifiers in almost empty spaces , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[54]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[55]  M. Opper,et al.  On the ability of the optimal perceptron to generalise , 1990 .

[56]  F. Vallet,et al.  Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions , 1989 .

[57]  Y. Gordon Some inequalities for Gaussian processes and applications , 1985 .

[58]  S. Shalev-Shwartz,et al.  Understanding Machine Learning - From Theory to Algorithms , 2014 .