Maximal Discrepancy for Support Vector Machines

The Maximal Discrepancy (MD) is a powerful statistical method, which has been proposed for model selection and error estimation in classification problems. This approach is particularly attractive when dealing with small sample problems, since it avoids the use of a separate validation set. Unfortunately, the MD method requires a bounded loss function, which is usually avoided by most learning algorithms, including the Support Vector Machine (SVM), because it gives rise to a non-convex optimization problem. We derive in this work a new approach for rigorously applying the MD technique to the error estimation of the SVM and, at the same time, preserving the original SVM framework.

[1]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[2]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[3]  Ambuj Tewari,et al.  Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results , 2007, J. Mach. Learn. Res..

[4]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[5]  A. Isaksson,et al.  Cross-validation and bootstrapping are unreliable in small sample classification , 2008, Pattern Recognit. Lett..

[6]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[7]  Davide Anguita,et al.  Quantum optimization for training support vector machines , 2003, Neural Networks.

[8]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[9]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[10]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[11]  Koby Crammer,et al.  Robust Support Vector Machine Training via Convex Outlier Ablation , 2006, AAAI.

[12]  Jie Li,et al.  Training robust support vector machine with smooth Ramp loss in the primal space , 2008, Neurocomputing.

[13]  Mohamed-Jalal Fadili,et al.  Non-convex onion-peeling using a shape hull algorithm , 2004, Pattern Recognit. Lett..

[14]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[15]  Magalie Fromont,et al.  Model selection by bootstrap penalization for classification , 2004, Machine Learning.

[16]  Davide Anguita,et al.  Hyperparameter design criteria for support vector classifiers , 2003, Neurocomputing.

[17]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[18]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[19]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[20]  P. Massart,et al.  Statistical performance of support vector machines , 2008, 0804.0551.

[21]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[22]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[23]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[24]  Davide Anguita,et al.  Model selection for support vector machines: Advantages and disadvantages of the Machine Learning Theory , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[25]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[26]  G. Lugosi,et al.  Data-dependent margin-based generalization bounds for classification , 2003 .

[27]  B. Bollobás Surveys in Combinatorics , 1979 .

[28]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[29]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[30]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.