Tikhonov, Ivanov and Morozov regularization for support vector machine learning

Learning according to the structural risk minimization principle can be naturally expressed as an Ivanov regularization problem. Vapnik himself pointed out this connection, when deriving an actual learning algorithm from this principle, like the well-known support vector machine, but quickly suggested to resort to a Tikhonov regularization schema, instead. This was, at that time, the best choice because the corresponding optimization problem is easier to solve and in any case, under certain hypothesis, the solutions obtained by the two approaches coincide. On the other hand, recent advances in learning theory clearly show that the Ivanov regularization scheme allows a more effective control of the learning hypothesis space and, therefore, of the generalization ability of the selected hypothesis. We prove in this paper the equivalence between the Ivanov and Tikhonov approaches and, for the sake of completeness, their connection to Morozov regularization, which has been show to be useful when effective estimation of the noise in the data is available. We also show that this equivalence is valid under milder conditions on the loss function with respect to Vapnik’s original proposal. These results allows us to derive several methods for performing SRM learning according to an Ivanov or Morozov regularization scheme, but using Tikhonov-based solvers, which have been thoroughly studied in the last decades and for which very efficient implementations have been proposed.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  E. L. Lawler,et al.  Branch-and-Bound Methods: A Survey , 1966, Oper. Res..

[3]  V. Ivanov,et al.  The Theory of Approximate Methods and Their Application to the Numerical Solution of Singular Integr , 1978 .

[4]  A. A. Goldstein,et al.  Optimization of lipschitz continuous functions , 1977, Math. Program..

[5]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[6]  U. Haagerup The best constants in the Khintchine inequality , 1981 .

[7]  C. Atkinson METHODS FOR SOLVING INCORRECTLY POSED PROBLEMS , 1985 .

[8]  M. Anthony Discrete Mathematics of Neural Networks: Selected Topics , 1987 .

[9]  L. Martein,et al.  On solving a linear program with one quadratic constraint , 1987 .

[10]  Dana Z. Anderson Neural information processing systems : Denver, Co, 1987 , 1988 .

[11]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[12]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[13]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[14]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[15]  Massimiliano Pontil,et al.  Properties of Support Vector Machines , 1998, Neural Computation.

[16]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[19]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[20]  John C. Platt Using Analytic QP and Sparseness to Speed Training of Support Vector Machines , 1998, NIPS.

[21]  Olivier Chapelle,et al.  Model Selection for Support Vector Machines , 1999, NIPS.

[22]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[23]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[24]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[25]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[26]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[27]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[28]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[29]  William H. Press,et al.  Numerical recipes in C , 2002 .

[30]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[31]  S. Sathiya Keerthi,et al.  Evaluation of simple performance measures for tuning SVM hyperparameters , 2003, Neurocomputing.

[32]  Shahar Mendelson,et al.  On the Performance of Kernel Classes , 2003, J. Mach. Learn. Res..

[33]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[34]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[35]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[36]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[37]  Michael I. Jordan,et al.  Computing regularization paths for learning multiple kernels , 2004, NIPS.

[38]  Johan A. K. Suykens,et al.  Morozov, Ivanov and Tikhonov Regularization Based LS-SVMs , 2004, ICONIP.

[39]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[40]  Carlo Tomasi Learning theory: Past performance and future results , 2004, Nature.

[41]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[42]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[43]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[44]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[45]  S. Sathiya Keerthi,et al.  Convergence of a Generalized SMO Algorithm for SVM Classifier Design , 2002, Machine Learning.

[46]  Massimiliano Pontil,et al.  Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers , 2004, Machine Learning.

[47]  Marcos M. Campos,et al.  SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines , 2005, VLDB.

[48]  Ji Zhu,et al.  Computing the Solution Path for the Regularized Support Vector Regression , 2005, NIPS.

[49]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[50]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[51]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[52]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[53]  Dariu Gavrila,et al.  An Experimental Study on Pedestrian Classification , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[55]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[56]  V. Koltchinskii Rejoinder: Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0135.

[57]  J. Wissel,et al.  On the Best Constants in the Khintchine Inequality , 2007 .

[58]  Yang Jing L1 Regularization Path Algorithm for Generalized Linear Models , 2008 .

[59]  Martin Sewell Structural Risk Minimization , 2008 .

[60]  Prasad Raghavendra,et al.  Agnostic Learning of Monomials by Halfspaces Is Hard , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[61]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[62]  K. Vorontsov Exact combinatorial bounds on the probability of overfitting for empirical risk minimization , 2010, Pattern Recognition and Image Analysis.

[63]  Davide Anguita,et al.  Model selection for support vector machines: Advantages and disadvantages of the Machine Learning Theory , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[64]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[65]  Shiliang Sun,et al.  A review of optimization methodologies in support vector machines , 2011, Neurocomputing.

[66]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[67]  Davide Anguita,et al.  In-sample model selection for Support Vector Machines , 2011, The 2011 International Joint Conference on Neural Networks.

[68]  Davide Anguita,et al.  The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers , 2011, NIPS.

[69]  Davide Anguita,et al.  In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[70]  Bernhard Schölkopf,et al.  The representer theorem for Hilbert spaces: a necessary and sufficient condition , 2012, NIPS.

[71]  Marius Kloft,et al.  Learning Kernels Using Local Rademacher Complexity , 2013, NIPS.

[72]  Davide Anguita,et al.  Fully Empirical and Data-Dependent Stability-Based Bounds , 2015, IEEE Transactions on Cybernetics.