On the proliferation of support vectors in high dimensions

The support vector machine (SVM) is a well-established classification method whose name refers to the particular training examples, called support vectors, that determine the maximum margin separating hyperplane. The SVM classifier is known to enjoy good generalization properties when the number of support vectors is small compared to the number of training examples. However, recent research has shown that in sufficiently high-dimensional linear classification problems, the SVM can generalize well despite a proliferation of support vectors where all training examples are support vectors. In this paper, we identify new deterministic equivalences for this phenomenon of support vector proliferation, and use them to (1) substantially broaden the conditions under which the phenomenon occurs in high-dimensional settings, and (2) prove a nearly matching converse result.

[1]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[2]  Philip M. Long,et al.  Finite-sample analysis of interpolating linear classifiers in the overparameterized regime , 2020, ArXiv.

[3]  Mirta B. Gordon,et al.  Robust learning and generalization with support vector machines , 2001 .

[4]  Richard G. Baraniuk,et al.  1-Bit compressive sensing , 2008, 2008 42nd Annual Conference on Information Sciences and Systems.

[5]  Eric B. Baum,et al.  A Polynomial Time Algorithm That Learns Two Hidden Unit Nets , 1990, Neural Computation.

[6]  Tom Downs,et al.  Exact Simplification of Support Vector Solutions , 2002, J. Mach. Learn. Res..

[7]  Anant Sahai,et al.  Harmless interpolation of noisy data in regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[8]  Partha P Mitra,et al.  Understanding overfitting peaks in generalization error: Analytical risk curves for l2 and l1 penalized interpolation , 2019, ArXiv.

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[11]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[12]  Haoyang Liu,et al.  Exact high-dimensional asymptotics for support vector machine , 2019, ArXiv.

[13]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[14]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[15]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[16]  Christopher J. C. Burges,et al.  Simplified Support Vector Decision Rules , 1996, ICML.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[19]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[20]  Dörthe Malzahn,et al.  A statistical physics approach for the analysis of machine learning algorithms on real data , 2005 .

[21]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[22]  Zhi-Hua Zhou,et al.  On the doubt about margin explanation of boosting , 2010, Artif. Intell..

[23]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[24]  John Shawe-Taylor,et al.  Generalization Performance of Support Vector Machines and Other Pattern Classifiers , 1999 .

[25]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[26]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[27]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[28]  John Shawe-Taylor,et al.  PAC-Bayesian Compression Bounds on the Prediction Error of Learning Algorithms for Classification , 2005, Machine Learning.

[29]  G. Pisier The volume of convex bodies and Banach space geometry , 1989 .

[30]  Allan Grønlund Jørgensen,et al.  Near-Tight Margin-Based Generalization Bounds for Support Vector Machines , 2020, ICML.

[31]  M. Opper,et al.  Statistical mechanics of Support Vector networks. , 1998, cond-mat/9811421.

[32]  François Laviolette,et al.  A PAC-Bayes Sample-compression Approach to Kernel Methods , 2011, ICML.

[33]  Rocco A. Servedio,et al.  Learning intersections of halfspaces with a margin , 2004, J. Comput. Syst. Sci..

[34]  David A. McAllester Simplified PAC-Bayesian Margin Bounds , 2003, COLT.

[35]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[36]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[37]  Zacharie Naulet,et al.  Asymptotic Risk of Least Squares Minimum Norm Estimator under the Spike Covariance Model , 2019, ArXiv.

[38]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[39]  Jianqing Fan,et al.  Asymptotics of empirical eigenstructure for high dimensional spiked covariance. , 2017, Annals of statistics.

[40]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[41]  Ambuj Tewari,et al.  Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results , 2007, J. Mach. Learn. Res..

[42]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[43]  Zacharie Naulet,et al.  Risk of the Least Squares Minimum Norm Estimator under the Spike Covariance Model , 2019 .

[44]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[45]  S. Sathiya Keerthi,et al.  Building Support Vector Machines with Reduced Classifier Complexity , 2006, J. Mach. Learn. Res..