Model selection for support vector machines: Advantages and disadvantages of the Machine Learning Theory

A common belief is that Machine Learning Theory (MLT) is not very useful, in pratice, for performing effective SVM model selection. This fact is supported by experience, because well-known hold-out methods like cross-validation, leave-one-out, and the bootstrap usually achieve better results than the ones derived from MLT. We show in this paper that, in a small sample setting, i.e. when the dimensionality of the data is larger than the number of samples, a careful application of the MLT can outperform other methods in selecting the optimal hyperparameters of a SVM.

[1]  N. Metropolis,et al.  The Monte Carlo method. , 1949 .

[2]  J. Mesirov,et al.  Chemosensitivity prediction by transcriptional profiling , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[3]  P. Massart,et al.  Statistical performance of support vector machines , 2008, 0804.0551.

[4]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Adam Tauman Kalai,et al.  Probabilistic and on-line methods in machine learning , 2001 .

[6]  Peter L. Bartlett,et al.  Local Complexities for Empirical Risk Minimization , 2004, COLT.

[7]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[8]  R. Fletcher Practical Methods of Optimization , 1988 .

[9]  Alan F. Murray,et al.  Novelty detection using products of simple experts--a potential architecture for embedded systems , 2001, Neural Networks.

[10]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[11]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[12]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[13]  Davide Anguita,et al.  Quantum optimization for training support vector machines , 2003, Neural Networks.

[14]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  K. G. Ramakrishnan,et al.  Computational results of an interior point algorithm for large scale linear programming , 1991, Math. Program..

[17]  J. D. Beasley,et al.  Algorithm AS 111: The Percentage Points of the Normal Distribution , 1977 .

[18]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[21]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[22]  Ambuj Tewari,et al.  Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results , 2007, J. Mach. Learn. Res..

[23]  Michaël Aupetit Nearly homogeneous multi-partitioning with a deterministic generator , 2009, Neurocomputing.

[24]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[25]  Colin Campbell,et al.  An introduction to kernel methods , 2001 .

[26]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[27]  M. Talagrand Transportation cost for Gaussian and other product measures , 1996 .

[28]  Davide Anguita,et al.  Testing the Augmented Binary Multiclass SVM on Microarray Data , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[29]  Keith Worden,et al.  STRUCTURAL FAULT DETECTION USING A NOVELTY MEASURE , 1997 .

[30]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[31]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[32]  Constantin F. Aliferis,et al.  Using the GEMS System for Cancer Diagnosis and Biomarker Discovery from Microarray Gene Expression Data , 2005, AAAI.

[33]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[34]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[35]  Peter L. Bartlett,et al.  Optimal Sample-Based Estimates of the Expectation of the Empirical Minimizer , 2005 .

[36]  Chih-Jen Lin,et al.  Asymptotic convergence of an SMO algorithm without any assumptions , 2002, IEEE Trans. Neural Networks.

[37]  Hava T. Siegelmann,et al.  A support vector clustering method , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[38]  J. Weston,et al.  Support Vector Machine Solvers , 2007 .

[39]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[40]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[41]  Stephen R. Marsland,et al.  Novelty Detection for Robot Neotaxis , 2000, ArXiv.

[42]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[43]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[44]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[45]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[46]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[47]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[48]  J. Simon Resampling: The new statistics , 1995 .

[49]  Francesco Camastra,et al.  A Novel Kernel Method for Clustering , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Davide Anguita,et al.  Evaluating the Generalization Ability of Support Vector Machines through the Bootstrap , 2000, Neural Processing Letters.

[51]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[52]  Marcos M. Campos,et al.  SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines , 2005, VLDB.

[53]  Graziano Pesole,et al.  On the statistical assessment of classifiers using DNA microarray data , 2006, BMC Bioinformatics.

[54]  Davide Anguita,et al.  K-Fold Cross Validation for Error Rate Estimate in Support Vector Machines , 2009, DMIN.

[55]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[56]  Yoshua Bengio,et al.  Série Scientifique Scientific Series No Unbiased Estimator of the Variance of K-fold Cross-validation No Unbiased Estimator of the Variance of K-fold Cross-validation , 2022 .

[57]  D. Chandler,et al.  Introduction To Modern Statistical Mechanics , 1987 .

[58]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[59]  Nicholas J. Higham,et al.  Matlab guide , 2000 .

[60]  I. Jolliffe Principal Component Analysis , 2002 .

[61]  StatnikovAlexander,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2005 .

[62]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[63]  Davide Anguita,et al.  Theoretical and Practical Model Selection Methods for Support Vector Classifiers , 2004 .

[64]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[65]  Jason Weston,et al.  Trading convexity for scalability , 2006, ICML.

[66]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[67]  Davide Anguita,et al.  Maximal Discrepancy for Support Vector Machines , 2011, ESANN.

[68]  Jie Li,et al.  Training robust support vector machine with smooth Ramp loss in the primal space , 2008, Neurocomputing.

[69]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[70]  D. Anguita,et al.  K-fold generalization capability assessment for support vector classifiers , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[71]  Constantin F. Aliferis,et al.  Methods for Multi-Category Cancer Diagnosis from Gene Expression Data: A Comprehensive Evaluation to Inform Decision Support System Development , 2004, MedInfo.

[72]  Andrew Whitechapel,et al.  Inside C , 2001 .

[73]  Robert Sedgewick,et al.  Algorithms in C , 1990 .

[74]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[75]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[76]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[77]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[78]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[79]  S. Boucheron,et al.  Concentration inequalities using the entropy method , 2003 .

[80]  Nello Cristianini,et al.  Margin Distribution and Soft Margin , 2000 .

[81]  S. Sathiya Keerthi,et al.  Evaluation of simple performance measures for tuning SVM hyperparameters , 2003, Neurocomputing.

[82]  L. Martein,et al.  On solving a linear program with one quadratic constraint , 1987 .

[83]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[84]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[85]  David Page Comparative Data Mining for Microarrays : A Case Study Based on Multiple Myeloma , 2002 .

[86]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[87]  Vladimir Cherkassky,et al.  Learning from Data: Concepts, Theory, and Methods , 1998 .

[88]  Martin Anthony,et al.  Aspects of discrete mathematics and probability in the theory of machine learning , 2008, Discret. Appl. Math..

[89]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[90]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[91]  Peter Kaiser,et al.  Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning , 2009, PLoS Comput. Biol..

[92]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[93]  A. Isaksson,et al.  Cross-validation and bootstrapping are unreliable in small sample classification , 2008, Pattern Recognit. Lett..

[94]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[95]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[96]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[97]  Paul D. Gader,et al.  Handprinted word recognition on a NIST data set , 2005, Machine Vision and Applications.

[98]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[99]  Stephen J. Chapman Fortran 90/95 for Scientists and Engineers , 1998 .

[100]  Dmitry Panchenko,et al.  Some New Bounds on the Generalization Error of Combined Classifiers , 2000, NIPS.

[101]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[102]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[103]  R. Tibshirani,et al.  A bias correction for the minimum error rate in cross-validation , 2009, 0908.2904.

[104]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[105]  W. Lockau,et al.  Contents , 2015 .