Support Vector Machines with the Ramp Loss and the Hard Margin Loss

In the interest of deriving classifiers that are robust to outlier observations, we present integer programming formulations of Vapnik's support vector machine (SVM) with the ramp loss and hard margin loss. The ramp loss allows a maximum error of 2 for each training observation, while the hard margin loss calculates error by counting the number of training observations that are in the margin or misclassified outside of the margin. SVM with these loss functions is shown to be a consistent estimator when used with certain kernel functions. In computational studies with simulated and real-world data, SVM with the robust loss functions ignores outlier observations effectively, providing an advantage over SVM with the traditional hinge loss when using the linear kernel. Despite the fact that training SVM with the robust loss functions requires the solution of a quadratic mixed-integer program (QMIP) and is NP-hard, while traditional SVM requires only the solution of a continuous quadratic program (QP), we are able to find good solutions and prove optimality for instances with up to 500 observations. Solution methods are presented for the new formulations that improve computational performance over industry-standard integer programming solvers alone.

[1]  David A. Patterson,et al.  Constrained discriminant analysis via 0/1 mixed integer programming , 1997, Ann. Oper. Res..

[2]  Yufeng Liu,et al.  Multicategory ψ-Learning and Support Vector Machine: Computational Tools , 2005 .

[3]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[4]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  W. Wong,et al.  On ψ-Learning , 2003 .

[7]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[8]  Ted K. Ralphs,et al.  Integer and Combinatorial Optimization , 2013 .

[9]  Olvi L. Mangasarian,et al.  Hybrid misclassification minimization , 1996, Adv. Comput. Math..

[10]  Gary J. Koehler,et al.  Minimizing Misclassifications in Linear Discriminant Analysis , 1990 .

[11]  Carlo Vercellis,et al.  Evaluating Membership Functions for Fuzzy Discrete SVM , 2007, WILF.

[12]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[13]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[14]  Fernando Pérez-Cruz,et al.  Empirical risk minimization for support vector classifiers , 2003, IEEE Trans. Neural Networks.

[15]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[16]  Dimitris Bertsimas,et al.  Classification and Regression via Integer Optimization , 2007, Oper. Res..

[17]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[18]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[19]  David G. Stork,et al.  Pattern Classification , 1973 .

[20]  J. Siemons Surveys in combinatorics, 1989 , 1989 .

[21]  Koby Crammer,et al.  Robust Support Vector Machine Training via Convex Outlier Ablation , 2006, AAAI.

[22]  Carlo Vercellis,et al.  Multivariate classification trees based on minimum features discrete support vector machines , 2003 .

[23]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[24]  Eva K. Lee,et al.  Analysis of the consistency of a mixed integer programming-based multi-category constrained discriminant model , 2010, Ann. Oper. Res..

[25]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[26]  Olvi L. Mangasarian,et al.  Multisurface method of pattern separation , 1968, IEEE Trans. Inf. Theory.

[27]  Jancik,et al.  Multisurface Method of Pattern Separation , 1993 .

[29]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[30]  Carlo Vercellis,et al.  Softening the Margin in Discrete SVM , 2007, Industrial Conference on Data Mining.

[31]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[32]  S. Sathiya Keerthi,et al.  Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..

[33]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[34]  Franco P. Preparata,et al.  The Densest Hemisphere Problem , 1978, Theor. Comput. Sci..

[35]  Andreas Christmann,et al.  On Robustness Properties of Convex Risk Minimization Methods for Pattern Recognition , 2004, J. Mach. Learn. Res..

[36]  L. Breiman Arcing Classifiers , 1998 .

[37]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[38]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.