In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines

In-sample approaches to model selection and error estimation of support vector machines (SVMs) are not as widespread as out-of-sample methods, where part of the data is removed from the training set for validation and testing purposes, mainly because their practical application is not straightforward and the latter provide, in many cases, satisfactory results. In this paper, we survey some recent and not-so-recent results of the data-dependent structural risk minimization framework and propose a proper reformulation of the SVM learning algorithm, so that the in-sample approach can be effectively applied. The experiments, performed both on simulated and real-world datasets, show that our in-sample approach can be favorably compared to out-of-sample methods, especially in cases where the latter ones provide questionable results. In particular, when the number of samples is small compared to their dimensionality, like in classification of microarray data, our proposal can outperform conventional out-of-sample approaches such as the cross validation, the leave-one-out, or the Bootstrap methods.

[1]  J. Langford Tutorial on Practical Prediction Theory for Classification , 2005, J. Mach. Learn. Res..

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  Davide Anguita,et al.  Model selection for support vector machines: Advantages and disadvantages of the Machine Learning Theory , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[4]  Vladimir Cherkassky,et al.  Model complexity control for regression using VC generalization bounds , 1999, IEEE Trans. Neural Networks.

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[7]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[8]  XiongHuilin,et al.  Facial expression recognition in JAFFE dataset based on Gaussian process classification , 2010 .

[9]  Chih-Jen Lin,et al.  Asymptotic convergence of an SMO algorithm without any assumptions , 2002, IEEE Trans. Neural Networks.

[10]  Davide Anguita,et al.  Theoretical and Practical Model Selection Methods for Support Vector Classifiers , 2004 .

[11]  E. S. Pearson,et al.  THE USE OF CONFIDENCE OR FIDUCIAL LIMITS ILLUSTRATED IN THE CASE OF THE BINOMIAL , 1934 .

[12]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[13]  M. Anthony Discrete Mathematics of Neural Networks: Selected Topics , 1987 .

[14]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[15]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[16]  Michaël Aupetit Nearly homogeneous multi-partitioning with a deterministic generator , 2009, Neurocomputing.

[17]  David Page Comparative Data Mining for Microarrays : A Case Study Based on Multiple Myeloma , 2002 .

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[20]  Davide Anguita,et al.  The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers , 2011, NIPS.

[21]  V. Bentkus On Hoeffding’s inequalities , 2004, math/0410159.

[22]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[23]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[24]  Todd E. Clark Can Out-of-Sample Forecast Comparisons Help Prevent Overfitting? , 2000 .

[25]  Davide Anguita,et al.  Maximal Discrepancy vs. Rademacher Complexity for error estimation , 2011, ESANN.

[26]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[27]  Davide Anguita,et al.  Selecting the hypothesis space for improving the generalization ability of Support Vector Machines , 2011, The 2011 International Joint Conference on Neural Networks.

[28]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[29]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[30]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.

[31]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[33]  Yaser S. Abu-Mostafa,et al.  Hints , 2018, Neural Computation.

[34]  S. Sathiya Keerthi,et al.  Evaluation of simple performance measures for tuning SVM hyperparameters , 2003, Neurocomputing.

[35]  Ichiro Takeuchi,et al.  Nonlinear Regularization Path for Quadratic Loss Support Vector Machines , 2011, IEEE Transactions on Neural Networks.

[36]  Johan A. K. Suykens,et al.  Approximate Confidence and Prediction Intervals for Least Squares Support Vector Regression , 2011, IEEE Transactions on Neural Networks.

[37]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[38]  Marcos M. Campos,et al.  SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines , 2005, VLDB.

[39]  Davide Anguita,et al.  In-sample model selection for Support Vector Machines , 2011, The 2011 International Joint Conference on Neural Networks.

[40]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Christian Igel,et al.  Maximum Likelihood Model Selection for 1-Norm Soft Margin SVMs with Multiple Parameters , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[43]  D. Anguita,et al.  K-fold generalization capability assessment for support vector classifiers , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[44]  Davide Anguita,et al.  Test error bounds for classifiers: A survey of old and new results , 2011, 2011 IEEE Symposium on Foundations of Computational Intelligence (FOCI).

[45]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[46]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[47]  Erik Ordentlich,et al.  On concentration for denoiser-loss estimators , 2009, 2009 IEEE International Symposium on Information Theory.

[48]  Jason Weston,et al.  Inference with the Universum , 2006, ICML.

[49]  David E. Rapach,et al.  In-sample vs. out-of-sample tests of stock return predictability in the context of data mining , 2006 .

[50]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[51]  Fei Cheng,et al.  Facial Expression Recognition in JAFFE Dataset Based on Gaussian Process Classification , 2010, IEEE Transactions on Neural Networks.

[52]  A. Isaksson,et al.  Cross-validation and bootstrapping are unreliable in small sample classification , 2008, Pattern Recognit. Lett..

[53]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Graziano Pesole,et al.  On the statistical assessment of classifiers using DNA microarray data , 2006, BMC Bioinformatics.

[55]  Davide Anguita,et al.  K-Fold Cross Validation for Error Rate Estimate in Support Vector Machines , 2009, DMIN.

[56]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[57]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[58]  Davide Anguita,et al.  Maximal Discrepancy for Support Vector Machines , 2011, ESANN.

[59]  C. M. Bishop,et al.  Improvements on Twin Support Vector Machines , 2011 .

[60]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[61]  K. Ramanathan,et al.  Keyword ( s ) : , 2008 .

[62]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[63]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[64]  R. Serfling Probability Inequalities for the Sum in Sampling without Replacement , 1974 .

[65]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[66]  Deyu Meng,et al.  Fast and Efficient Strategies for Model Selection of Gaussian Support Vector Machine , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[67]  Isabelle Guyon,et al.  Comparison of classifier methods: a case study in handwritten digit recognition , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[68]  S. Kutin Extensions to McDiarmid's inequality when dierences are bounded with high probability , 2002 .

[69]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..