Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View

Contemporary machine learning applications often involve classification tasks with many classes. Despite their extensive use, a precise understanding of the statistical properties and behavior of classification algorithms is still missing, especially in modern regimes where the number of classes is rather large. In this paper, we take a step in this direction by providing the first asymptotically precise analysis of linear multiclass classification. Our theoretical analysis allows us to precisely characterize how the test error varies over different training algorithms, data distributions, problem dimensions as well as number of classes, inter/intra class correlations and class priors. Specifically, our analysis reveals that the classification accuracy is highly distribution-dependent with different algorithms achieving optimal performance for different data distributions and/or training/features sizes. Unlike linear regression/binary classification, the test error in multiclass classification relies on intricate functions of the trained model (e.g., correlation between some of the trained weights) whose asymptotic behavior is difficult to characterize. This challenge is already present in simple classifiers, such as those minimizing a square loss. Our novel theoretical techniques allow us to overcome some of these challenges. The insights gained may pave the way for a precise understanding of other classification algorithms beyond those studied in this paper.

[1]  Panagiotis Lolas,et al.  Regularization in High-Dimensional Regression and Classification via Random Matrix Theory , 2020, 2003.13723.

[2]  Andrea Montanari,et al.  The distribution of the Lasso: Uniform control over sparse balls and adaptive parameter tuning , 2018, The Annals of Statistics.

[3]  Christos Thrampoulidis,et al.  LASSO with Non-linear Measurements is Equivalent to One With Linear Measurements , 2015, NIPS.

[4]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[5]  Tengyuan Liang,et al.  A Precise High-Dimensional Asymptotic Theory for Boosting and Min-L1-Norm Interpolated Classifiers , 2020, ArXiv.

[6]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[7]  Zhenyu Liao,et al.  A Large Scale Analysis of Logistic Regression: Asymptotic Performance and New Insights , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  E. Candès,et al.  The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression , 2018, The Annals of Statistics.

[9]  Andrea Montanari,et al.  The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .

[10]  Joel A. Tropp,et al.  Living on the edge: A geometric theory of phase transitions in convex optimization , 2013, ArXiv.

[11]  Christos Thrampoulidis,et al.  Regularized Linear Regression: A Precise Analysis of the Estimation Error , 2015, COLT.

[12]  Dimitris Samaras,et al.  Squared Earth Mover's Distance-based Loss for Training Deep Neural Networks , 2016, ArXiv.

[13]  Marius Kloft,et al.  Data-Dependent Generalization Bounds for Multi-Class Classification , 2017, IEEE Transactions on Information Theory.

[14]  Christos Thrampoulidis,et al.  Fundamental Limits of Ridge-Regularized Empirical Risk Minimization in High Dimensions , 2020, AISTATS.

[15]  Mehryar Mohri,et al.  Structured Prediction Theory Based on Factor Graph Complexity , 2016, NIPS.

[16]  Hong Hu,et al.  Asymptotics and Optimal Designs of SLOPE for Sparse Linear Regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[17]  Babak Hassibi,et al.  New Null Space Results and Recovery Thresholds for Matrix Rank Minimization , 2010, ArXiv.

[18]  Samet Oymak,et al.  Exploring the Role of Loss Functions in Multiclass Classification , 2020, 2020 54th Annual Conference on Information Sciences and Systems (CISS).

[19]  Benjamin Recht,et al.  Sharp Time–Data Tradeoffs for Linear Inverse Problems , 2015, IEEE Transactions on Information Theory.

[20]  Krzysztof Gajowniczek,et al.  Generalized Entropy Cost Function in Neural Networks , 2017, ICANN.

[21]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[22]  M. Stojnic Various thresholds for $\ell_1$-optimization in compressed sensing , 2009 .

[23]  Dale Schuurmans,et al.  On the Global Convergence Rates of Softmax Policy Gradient Methods , 2020, ICML.

[24]  A. Montanari,et al.  Fundamental barriers to high-dimensional regression with convex penalties , 2019, The Annals of Statistics.

[25]  Florent Krzakala,et al.  The role of regularization in classification of high-dimensional noisy Gaussian mixture , 2020, ICML.

[26]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[27]  Alexander Binder,et al.  Multi-class SVMs: From Tighter Data-Dependent Generalization Bounds to Novel Algorithms , 2015, NIPS.

[28]  慧 廣瀬 A Mathematical Introduction to Compressive Sensing , 2015 .

[29]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[30]  Babak Hassibi,et al.  The Impact of Regularization on High-dimensional Logistic Regression , 2019, NeurIPS.

[31]  Andrea Montanari,et al.  High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[32]  Noureddine El Karoui,et al.  Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators : rigorous results , 2013, 1311.2445.

[33]  Andrea Montanari,et al.  The LASSO Risk for Gaussian Matrices , 2010, IEEE Transactions on Information Theory.

[34]  Hanwen Huang,et al.  Asymptotic behavior of Support Vector Machine for spiked population model , 2017, J. Mach. Learn. Res..

[35]  Noureddine El Karoui,et al.  On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[36]  Ambuj Tewari,et al.  On the Consistency of Multiclass Classification Methods , 2007, J. Mach. Learn. Res..

[37]  M. Mohri,et al.  Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes , 2015 .

[38]  Csaba Szepesvári,et al.  Multiclass Classification Calibration Functions , 2016, ArXiv.

[39]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[40]  Andreas Maurer,et al.  A Vector-Contraction Inequality for Rademacher Complexities , 2016, ALT.

[41]  E. Candès,et al.  A modern maximum-likelihood theory for high-dimensional logistic regression , 2018, Proceedings of the National Academy of Sciences.

[42]  Christos Thrampoulidis,et al.  Precise Error Analysis of Regularized $M$ -Estimators in High Dimensions , 2016, IEEE Transactions on Information Theory.

[43]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[44]  Mihailo Stojnic,et al.  A framework to characterize performance of LASSO algorithms , 2013, ArXiv.

[45]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[46]  Weijie Su,et al.  Algorithmic Analysis and Statistical Estimation of SLOPE via Approximate Message Passing , 2019, IEEE Transactions on Information Theory.

[47]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[48]  P. S. Sastry,et al.  Robust Loss Functions for Learning Multi-class Classifiers , 2018, 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[49]  Yann Guermeur,et al.  Combining Discriminant Models with New Multi-Class SVMs , 2002, Pattern Analysis & Applications.

[50]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[51]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, ArXiv.

[52]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[53]  Andrea Montanari,et al.  The Noise-Sensitivity Phase Transition in Compressed Sensing , 2010, IEEE Transactions on Information Theory.

[54]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[55]  Ankit Singh Rawat,et al.  Sampled Softmax with Random Fourier Features , 2019, NeurIPS.

[56]  Yong Liu,et al.  Multi-Class Learning: From Theory to Algorithm , 2018, NeurIPS.

[57]  Noureddine El Karoui On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators , 2018 .

[58]  Csaba Szepesvári,et al.  Cost-sensitive Multiclass Classification Risk Bounds , 2013, ICML.

[59]  Christos Thrampoulidis,et al.  Symbol Error Rate Performance of Box-Relaxation Decoders in Massive MIMO , 2018, IEEE Transactions on Signal Processing.

[60]  Mihailo Stojnic,et al.  Various thresholds for ℓ1-optimization in compressed sensing , 2009, ArXiv.

[61]  Enkelejd Hashorva,et al.  On multivariate Gaussian tails , 2003 .

[62]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[63]  Babak Hassibi,et al.  Universality in Learning from Linear Measurements , 2019, NeurIPS.

[64]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[65]  Adel Javanmard,et al.  Precise Tradeoffs in Adversarial Training for Linear Regression , 2020, COLT.

[66]  Johannes Fürnkranz,et al.  Round Robin Classification , 2002, J. Mach. Learn. Res..

[67]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[68]  Mohamed-Slim Alouini,et al.  On the Precise Error Analysis of Support Vector Machines , 2020, IEEE Open Journal of Signal Processing.

[69]  Christos Thrampoulidis,et al.  Analytic Study of Double Descent in Binary Classification: The Impact of Loss , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[70]  A. Choromańska Extreme Multi Class Classification , 2013 .

[71]  Paul M. Mather,et al.  Support vector machines for classification in remote sensing , 2005 .

[72]  Andries Petrus Engelbrecht,et al.  Visualising Basins of Attraction for the Cross-Entropy and the Squared Error Neural Network Loss Functions , 2019, Neurocomputing.

[73]  Y. S. Sathe,et al.  A note on the inequalities for tail probability of the multivariate normal distribution , 1980 .

[74]  Kristin P. Bennett,et al.  Multicategory Classification by Support Vector Machines , 1999, Comput. Optim. Appl..

[75]  Christos Thrampoulidis,et al.  Sharp Asymptotics and Optimal Performance for Inference in Binary Models , 2020, AISTATS.

[76]  Y. Gordon On Milman's inequality and random subspaces which escape through a mesh in ℝ n , 1988 .

[77]  Christos Thrampoulidis,et al.  The squared-error of generalized LASSO: A precise analysis , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[78]  Marc Lelarge,et al.  Asymptotic Bayes Risk for Gaussian Mixture in a Semi-Supervised Setting , 2019, 2019 IEEE 8th International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP).

[79]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[80]  R. Prentice,et al.  Commentary on Andersen and Gill's "Cox's Regression Model for Counting Processes: A Large Sample Study" , 1982 .

[81]  A. Maleki,et al.  Does SLOPE outperform bridge regression? , 2019, ArXiv.

[82]  Karthikeyan Natesan Ramamurthy,et al.  Empirically-estimable multi-class classification bounds , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).