Hyper-parameter optimization for support vector machines using stochastic gradient descent and dual coordinate descent

We developed a gradient-based method to optimize the regularization hyper-parameter, C , for support vector machines in a bilevel optimization framework. On the upper level, we optimized the hyper-parameter C to minimize the prediction loss on validation data using stochastic gradient descent. On the lower level, we used dual coordinate descent to optimize the parameters of support vector machines to minimize the loss on training data. The gradient of the loss function on the upper level with respect to the hyper-parameter, C , was computed using the implicit function theorem combined with the optimality condition of the lower-level problem, i.e., the dual problem of support vector machines. We compared our method with the existing gradient-based method in the literature on several datasets. Numerical results showed that our method converges faster to the optimal solution and achieves better prediction accuracy for large-scale support vector machine problems.

[1]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[2]  Jing Hu,et al.  Bilevel Model Selection for Support Vector Machines , 2007 .

[3]  Yasubumi Sakakibara,et al.  Gradient-based optimization of hyperparameters for base-pairing profile local alignment kernels. , 2009, Genome informatics. International Conference on Genome Informatics.

[4]  Nicolas P. Couellan On the convergence of stochastic bi-level gradient methods , 2016 .

[5]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[6]  Michael St. Jules Experiments With Scalable Gradient-based Hyperparameter Optimization for Deep Neural Networks by , 2017 .

[7]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  Chuan-Sheng Foo,et al.  Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[10]  Elliot Meyerson,et al.  Evolving Deep Neural Networks , 2017, Artificial Intelligence in the Age of Neural Networks and Brain Computing.

[11]  I. Shpitser,et al.  Machine Learning Methods Uncover Radiomorphologic Dose Patterns in Salivary Glands that Predict Xerostomia in Patients with Head and Neck Cancer , 2018, Advances in radiation oncology.

[12]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[13]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[14]  Dan Boneh,et al.  On genetic algorithms , 1995, COLT '95.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[17]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[19]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[20]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[21]  Kian Hsiang Low,et al.  DrMAD: Distilling Reverse-Mode Automatic Differentiation for Optimizing Hyperparameters of Deep Neural Networks , 2016, IJCAI.

[22]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[23]  Jing Hu,et al.  Bilevel Optimization and Machine Learning , 2008, WCCI.

[24]  Nicolas Couellan,et al.  Bi-level stochastic gradient for large scale support vector machine , 2015, Neurocomputing.

[25]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[26]  I. Shpitser,et al.  Dose/Volume histogram patterns in Salivary Gland subvolumes influence xerostomia injury and recovery , 2019, Scientific Reports.

[27]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[28]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[29]  J. Mockus Bayesian Approach to Global Optimization: Theory and Applications , 1989 .