Fully Empirical and Data-Dependent Stability-Based Bounds

The purpose of this paper is to obtain a fully empirical stability-based bound on the generalization ability of a learning procedure, thus, circumventing some limitations of the structural risk minimization framework. We show that assuming a desirable property of a learning algorithm is sufficient to make data-dependency explicit for stability, which, instead, is usually bounded only in an algorithmic-dependent way. In addition, we prove that a well-known and widespread classifier, like the support vector machine (SVM), satisfies this condition. The obtained bound is then exploited for model selection purposes in SVM classification and tested on a series of real-world benchmarking datasets demonstrating, in practice, the effectiveness of our approach.

[1]  S. Sathiya Keerthi,et al.  Parallel sequential minimal optimization for the training of support vector machines , 2006, IEEE Trans. Neural Networks.

[2]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[3]  O. Bousquet A Bennett concentration inequality and its application to suprema of empirical processes , 2002 .

[4]  M. Opper,et al.  On the ability of the optimal perceptron to generalise , 1990 .

[5]  Deyu Meng,et al.  Fast and Efficient Strategies for Model Selection of Gaussian Support Vector Machine , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[6]  T. Poggio,et al.  STABILITY RESULTS IN LEARNING THEORY , 2005 .

[7]  Isabelle Guyon,et al.  Comparison of classifier methods: a case study in handwritten digit recognition , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[8]  S. Kutin Extensions to McDiarmid's inequality when dierences are bounded with high probability , 2002 .

[9]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[10]  Li Li,et al.  Support Vector Machines , 2015 .

[11]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[12]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[13]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[14]  Ichiro Takeuchi,et al.  Multiple Incremental Decremental Learning of Support Vector Machines , 2009, IEEE Transactions on Neural Networks.

[15]  Davide Anguita,et al.  K-Fold Cross Validation for Error Rate Estimate in Support Vector Machines , 2009, DMIN.

[16]  M. Opper,et al.  Statistical mechanics of Support Vector networks. , 1998, cond-mat/9811421.

[17]  Sayan Mukherjee,et al.  Estimating Dataset Size Requirements for Classifying DNA Microarray Data , 2003, J. Comput. Biol..

[18]  Don R. Hush,et al.  Machine Learning with Data Dependent Hypothesis Classes , 2002, J. Mach. Learn. Res..

[19]  Davide Anguita,et al.  The 'K' in K-fold Cross Validation , 2012, ESANN.

[20]  Przemyslaw Klesk,et al.  Sets of approximating functions with finite Vapnik-Chervonenkis dimension for nearest-neighbors algorithms , 2011, Pattern Recognit. Lett..

[21]  S. Boucheron,et al.  A sharp concentration inequality with applications , 1999, Random Struct. Algorithms.

[22]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[23]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[24]  Gert Cauwenberghs,et al.  Incremental and Decremental Support Vector Machine Learning , 2000, NIPS.

[25]  Christian Igel,et al.  Maximum Likelihood Model Selection for 1-Norm Soft Margin SVMs with Multiple Parameters , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[27]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[28]  Davide Anguita,et al.  In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[29]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[30]  Ichiro Takeuchi,et al.  Nonlinear Regularization Path for Quadratic Loss Support Vector Machines , 2011, IEEE Transactions on Neural Networks.

[31]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[32]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[33]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[34]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[35]  David A. McAllester PAC-Bayesian Stochastic Model Selection , 2003, Machine Learning.

[36]  Shivani Agarwal,et al.  Stability and Generalization of Bipartite Ranking Algorithms , 2005, COLT.

[37]  H. Akaike A new look at the statistical model identification , 1974 .

[38]  José Antonio Lozano,et al.  Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[40]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[41]  Keinosuke Fukunaga,et al.  Leave-One-Out Procedures for Nonparametric Error Estimates , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[43]  S. Sathiya Keerthi,et al.  An efficient method for computing leave-one-out error in support vector machines with Gaussian kernels , 2004, IEEE Transactions on Neural Networks.

[44]  Shiliang Sun,et al.  A review of optimization methodologies in support vector machines , 2011, Neurocomputing.

[45]  D. Anguita,et al.  K-fold generalization capability assessment for support vector classifiers , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[46]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[47]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[48]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Thore Graepel,et al.  A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work , 2000, NIPS.

[50]  Edward R. Dougherty,et al.  Is cross-validation valid for small-sample microarray classification? , 2004, Bioinform..

[51]  Ivor W. Tsang,et al.  Very Large SVM Training using Core Vector Machines , 2005, AISTATS.

[52]  Ding-Xuan Zhou,et al.  Learning with sample dependent hypothesis spaces , 2008, Comput. Math. Appl..

[53]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[54]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-one-Out Cross-Validation , 1997, COLT.

[55]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[56]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[57]  M. Talagrand A new look at independence , 1996 .

[58]  Zongben Xu,et al.  Dynamic Extreme Learning Machine and Its Approximation Capability , 2013, IEEE Transactions on Cybernetics.

[59]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[60]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[61]  Samy Bengio,et al.  A Parallel Mixture of SVMs for Very Large Scale Problems , 2001, Neural Computation.

[62]  Yen-Liang Chen,et al.  A Novel Decision-Tree Method for Structured Continuous-Label Classification , 2013, IEEE Transactions on Cybernetics.

[63]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[64]  Francisco Herrera,et al.  Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[65]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[66]  Davide Anguita,et al.  In-sample model selection for Support Vector Machines , 2011, The 2011 International Joint Conference on Neural Networks.

[67]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[68]  Peter L. Bartlett,et al.  Localized Rademacher Complexities , 2002, COLT.

[69]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[70]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[71]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[72]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[73]  M. Opper Statistical Mechanics of Learning : Generalization , 2002 .

[74]  Johan A. K. Suykens,et al.  Approximate Confidence and Prediction Intervals for Least Squares Support Vector Regression , 2011, IEEE Transactions on Neural Networks.

[75]  Ingo Steinwart,et al.  Consistency of support vector machines and other regularized kernel classifiers , 2005, IEEE Transactions on Information Theory.

[76]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[77]  Marcos M. Campos,et al.  SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines , 2005, VLDB.

[78]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[79]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[80]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[81]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[82]  Journal Url,et al.  A tail inequality for suprema of unbounded empirical processes with applications to Markov chains , 2008 .

[83]  Michaël Aupetit Nearly homogeneous multi-partitioning with a deterministic generator , 2009, Neurocomputing.

[84]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.