Benign Overfitting in Multiclass Classification: All Roads Lead to Interpolation

The literature on"benign overfitting"in overparameterized models has been mostly restricted to regression or binary classification; however, modern machine learning operates in the multiclass setting. Motivated by this discrepancy, we study benign overfitting in multiclass linear classification. Specifically, we consider the following training algorithms on separable data: (i) empirical risk minimization (ERM) with cross-entropy loss, which converges to the multiclass support vector machine (SVM) solution; (ii) ERM with least-squares loss, which converges to the min-norm interpolating (MNI) solution; and, (iii) the one-vs-all SVM classifier. First, we provide a simple sufficient deterministic condition under which all three algorithms lead to classifiers that interpolate the training data and have equal accuracy. When the data is generated from Gaussian mixtures or a multinomial logistic model, this condition holds under high enough effective overparameterization. We also show that this sufficient condition is satisfied under"neural collapse", a phenomenon that is observed in training deep neural networks. Second, we derive novel bounds on the accuracy of the MNI classifier, thereby showing that all three training algorithms lead to benign overfitting under sufficient overparameterization. Ultimately, our analysis shows that good generalization is possible for SVM solutions beyond the realm in which typical margin-based bounds apply.

[1]  Jianfeng Lu,et al.  Neural collapse under cross-entropy loss , 2022, Applied and Computational Harmonic Analysis.

[2]  Marc Niethammer,et al.  Dissecting Supervised Constrastive Learning , 2021, ICML.

[3]  Christos Thrampoulidis,et al.  Phase Transitions for One-Vs-One and One-Vs-All Linear Separability in Multiclass Gaussian Mixtures , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  X. Y. Han,et al.  Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path , 2021, ICLR.

[5]  Clayton Sanford,et al.  Support vector machines and linear regression coincide with very high-dimensional features , 2021, NeurIPS.

[6]  Zhihui Zhu,et al.  A Geometric Analysis of Neural Collapse with Unconstrained Features , 2021, NeurIPS.

[7]  Mikhail Belkin,et al.  Risk Bounds for Over-parameterized Maximum Margin Classification on Sub-Gaussian Mixtures , 2021, NeurIPS.

[8]  Christos Thrampoulidis,et al.  Label-Imbalanced and Group-Sensitive Classification under Overparameterization , 2021, NeurIPS.

[9]  Hangfeng He,et al.  Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training , 2021, Proceedings of the National Academy of Sciences.

[10]  Benjamin Recht,et al.  Interpolating Classifiers Make Few Mistakes , 2021, J. Mach. Learn. Res..

[11]  Qianli Liao,et al.  Explicit regularization and implicit bias in deep network classifiers trained with the square loss , 2020, ArXiv.

[12]  Dustin G. Mixon,et al.  Neural collapse with unconstrained features , 2020, Sampling Theory, Signal Processing, and Data Analysis.

[13]  Christos Thrampoulidis,et al.  Binary Classification of Gaussian Mixtures: Abundance of Support Vectors, Benign Overfitting, and Regularization , 2020, SIAM Journal on Mathematics of Data Science.

[14]  Christos Thrampoulidis,et al.  Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View , 2020, NeurIPS.

[15]  Philip M. Long,et al.  Failures of model-dependent generalization bounds for least-norm interpolation , 2020, Journal of machine learning research.

[16]  P. Bartlett,et al.  Benign overfitting in ridge regression , 2020, J. Mach. Learn. Res..

[17]  Daniel J. Hsu,et al.  On the proliferation of support vectors in high dimensions , 2020, AISTATS.

[18]  Yue M. Lu,et al.  A Precise Performance Analysis of Learning with Random Features , 2020, ArXiv.

[19]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[20]  Ankit Singh Rawat,et al.  Long-tail learning via logit adjustment , 2020, ICLR.

[21]  Babak Hassibi,et al.  The Performance Analysis of Generalized Margin Maximizer (GMM) on Separable Data , 2020, ICML.

[22]  Christos Thrampoulidis,et al.  Fundamental Limits of Ridge-Regularized Empirical Risk Minimization in High Dimensions , 2020, AISTATS.

[23]  Mikhail Belkin,et al.  Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks , 2020, ICLR.

[24]  Yue M. Lu,et al.  Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization , 2020, NeurIPS.

[25]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[26]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, J. Mach. Learn. Res..

[27]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[28]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[29]  Panagiotis Lolas,et al.  Regularization in High-Dimensional Regression and Classification via Random Matrix Theory , 2020, 2003.13723.

[30]  Mohamed-Slim Alouini,et al.  On the Precise Error Analysis of Support Vector Machines , 2020, IEEE Open Journal of Signal Processing.

[31]  Samet Oymak,et al.  Exploring the Role of Loss Functions in Multiclass Classification , 2020, 2020 54th Annual Conference on Information Sciences and Systems (CISS).

[32]  Christos Thrampoulidis,et al.  Sharp Asymptotics and Optimal Performance for Inference in Binary Models , 2020, AISTATS.

[33]  Tengyuan Liang,et al.  A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers , 2020, SSRN Electronic Journal.

[34]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, Information and Inference: A Journal of the IMA.

[35]  A. Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[36]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[37]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[38]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[39]  Babak Hassibi,et al.  The Impact of Regularization on High-dimensional Logistic Regression , 2019, NeurIPS.

[40]  Zhenyu Liao,et al.  A Large Scale Analysis of Logistic Regression: Asymptotic Performance and New Insights , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[42]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[43]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[44]  Andries Petrus Engelbrecht,et al.  Visualising Basins of Attraction for the Cross-Entropy and the Squared Error Neural Network Loss Functions , 2019, Neurocomputing.

[45]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[46]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[47]  P. S. Sastry,et al.  Robust Loss Functions for Learning Multi-class Classifiers , 2018, 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[48]  D. Kobak,et al.  Optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization , 2018, 1805.10939.

[49]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[50]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[51]  Krzysztof Gajowniczek,et al.  Generalized Entropy Cost Function in Neural Networks , 2017, ICANN.

[52]  Marius Kloft,et al.  Data-Dependent Generalization Bounds for Multi-Class Classification , 2017, IEEE Transactions on Information Theory.

[53]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[54]  Dimitris Samaras,et al.  Squared Earth Mover's Distance-based Loss for Training Deep Neural Networks , 2016, ArXiv.

[55]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[56]  Csaba Szepesvári,et al.  Multiclass Classification Calibration Functions , 2016, ArXiv.

[57]  Mehryar Mohri,et al.  Structured Prediction Theory Based on Factor Graph Complexity , 2016, NIPS.

[58]  Andreas Maurer,et al.  A Vector-Contraction Inequality for Rademacher Complexities , 2016, ALT.

[59]  Alexander Binder,et al.  Multi-class SVMs: From Tighter Data-Dependent Generalization Bounds to Novel Algorithms , 2015, NIPS.

[60]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[61]  Csaba Szepesvári,et al.  Cost-sensitive Multiclass Classification Risk Bounds , 2013, ICML.

[62]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[63]  Nuno Vasconcelos,et al.  Cost-Sensitive Support Vector Machines , 2012, Neurocomputing.

[64]  François Laviolette,et al.  A PAC-Bayes Sample-compression Approach to Kernel Methods , 2011, ICML.

[65]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[66]  D. Bernstein Matrix Mathematics: Theory, Facts, and Formulas , 2009 .

[67]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[68]  Dörthe Malzahn,et al.  A statistical physics approach for the analysis of machine learning algorithms on real data , 2005 .

[69]  John Shawe-Taylor,et al.  PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification , 2005, Machine Learning.

[70]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[71]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[72]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[73]  Johannes Fürnkranz,et al.  Round Robin Classification , 2002, J. Mach. Learn. Res..

[74]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[75]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[76]  Mirta B. Gordon,et al.  Robust learning and generalization with support vector machines , 2001 .

[77]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[78]  M. Opper,et al.  Statistical mechanics of Support Vector networks. , 1998, cond-mat/9811421.

[79]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[80]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[81]  Hanwen Huang,et al.  Asymptotic behavior of Support Vector Machine for spiked population model , 2017, J. Mach. Learn. Res..

[82]  Lea Fleischer,et al.  Regularization of Inverse Problems , 1996 .

[83]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[84]  Tomaso Poggio,et al.  Everything old is new again: a fresh look at historical approaches in machine learning , 2002 .

[85]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[86]  Kristin P. Bennett,et al.  Multicategory Classification by Support Vector Machines , 1999, Comput. Optim. Appl..

[87]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[88]  O. Mangasarian,et al.  Multicategory discrimination via linear programming , 1994 .