Bi-level stochastic gradient for large scale support vector machine

Abstract We propose a new bi-level stochastic optimization algorithm for training large scale support vector machine (SVM) with automatic selection of the C hyperparameter. We show that in the proposed bi-level formulation, the variation of the inner objective with respect to the outer variable can be nicely expressed. Gradient estimates are computed for both inner and outer objectives in order to perform stochastic moves with low complexity. Extension to nonlinear SVM is also proposed. We further discuss the possibility to integrate the technique within an automatic k-fold cross validation framework. Preliminary results on several datasets show that the method is finding the optimum hyperplane while adjusting the penalty parameter with significant computational time savings when compared to the classic cross validation procedure.

[1]  Davide Anguita,et al.  Theoretical and Practical Model Selection Methods for Support Vector Classifiers , 2004 .

[2]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[3]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[4]  Patrice Marcotte,et al.  An overview of bilevel optimization , 2007, Ann. Oper. Res..

[5]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[8]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[9]  L. Lasdon,et al.  Derivative evaluation and computational experience with large bilevel mathematical programs , 1990 .

[10]  Jing Hu,et al.  Bilevel Model Selection for Support Vector Machines , 2007 .

[11]  Anthony V. Fiacco,et al.  Sensitivity analysis for nonlinear programming using penalty methods , 1976, Math. Program..

[12]  L. N. Vicente,et al.  Descent approaches for quadratic bilevel programming , 1994 .

[13]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[14]  Jiming Peng,et al.  Self-adaptive support vector machines: modelling and experiments , 2009, Comput. Manag. Sci..

[15]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[16]  Walter Zucchini,et al.  Model Selection , 2011, International Encyclopedia of Statistical Science.

[17]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[18]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[19]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[20]  W. Rudin Principles of mathematical analysis , 1964 .

[21]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[22]  Isabelle Guyon ClopiNet A practical guide to model selection , 2009 .