A novel Frank-Wolfe algorithm. Analysis and applications to large-scale SVM training

Recently, there has been a renewed interest in the machine learning community for variants of a sparse greedy approximation procedure for concave optimization known as {the Frank-Wolfe (FW) method}. In particular, this procedure has been successfully applied to train large-scale instances of non-linear Support Vector Machines (SVMs). Specializing FW to SVM training has allowed to obtain efficient algorithms but also important theoretical results, including convergence analysis of training algorithms and new characterizations of model sparsity. In this paper, we present and analyze a novel variant of the FW method based on a new way to perform away steps, a classic strategy used to accelerate the convergence of the basic FW procedure. Our formulation and analysis is focused on a general concave maximization problem on the simplex. However, the specialization of our algorithm to quadratic forms is strongly related to some classic methods in computational geometry, namely the Gilbert and MDM algorithms. On the theoretical side, we demonstrate that the method matches the guarantees in terms of convergence rate and number of iterations obtained by using classic away steps. In particular, the method enjoys a linear rate of convergence, a result that has been recently proved for MDM on quadratic forms. On the practical side, we provide experiments on several classification datasets, and evaluate the results using statistical tests. Experiments show that our method is faster than the FW method with classic away steps, and works well even in the cases in which classic away steps slow down the algorithm. Furthermore, these improvements are obtained without sacrificing the predictive accuracy of the obtained SVM model.

[1]  Alexander G. Gray,et al.  Fast Stochastic Frank-Wolfe Algorithms for Nonlinear SVMs , 2010, SDM.

[2]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[3]  Daniela di Serafino,et al.  Recent advances in nonlinear optimization and equilibrium problems : a tribute to Marco D'Apuzzo , 2012 .

[4]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[5]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[6]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.

[7]  M. J. D. Powell,et al.  Nonlinear Programming—Sequential Unconstrained Minimization Techniques , 1969 .

[8]  Patrice Marcotte,et al.  Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[9]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[10]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[11]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[12]  Kristin P. Bennett,et al.  Duality and Geometry in SVM Classifiers , 2000, ICML.

[13]  Katya Scheinberg,et al.  An Efficient Implementation of an Active Set Method for SVMs , 2006, J. Mach. Learn. Res..

[14]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[15]  José R. Dorronsoro,et al.  On the Equivalence of the SMO and MDM Algorithms for SVM Training , 2008, ECML/PKDD.

[16]  S. Sathiya Keerthi,et al.  A fast iterative nearest point algorithm for support vector machine classifier design , 2000, IEEE Trans. Neural Networks Learn. Syst..

[17]  Martin Jaggi,et al.  Coresets for polytope distance , 2009, SCG '09.

[18]  E. Gilbert An Iterative Procedure for Computing the Minimum of a Quadratic Form on a Convex Set , 1966 .

[19]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[20]  E. Alper Yildirim,et al.  Two Algorithms for the Minimum Enclosing Ball Problem , 2008, SIAM J. Optim..

[21]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[22]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[23]  Claudio Sartori,et al.  Novel Frank-Wolfe Methods for SVM Learning , 2013, ArXiv.

[24]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[25]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[26]  Elad Hazan,et al.  A Polynomial Time Conditional Gradient Algorithm with Applications to Online and Stochastic Optimization , 2013, ArXiv.

[27]  Piyush Kumar,et al.  A Linearly Convergent Linear-Time First-Order Algorithm for Support Vector Classification with a Core Set Result , 2011, INFORMS J. Comput..

[28]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[29]  David P. Woodruff,et al.  Sublinear Optimization for Machine Learning , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[30]  José R. Dorronsoro,et al.  The convergence rate of the MDM algorithm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[31]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[33]  Thomas Hofmann,et al.  Predicting Structured Data (Neural Information Processing) , 2007 .

[34]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[35]  S. Sathiya Keerthi,et al.  Convergence of a Generalized SMO Algorithm for SVM Classifier Design , 2002, Machine Learning.

[36]  Claudio Sartori,et al.  Training Support Vector Machines using Frank-Wolfe Optimization Methods , 2012, Int. J. Pattern Recognit. Artif. Intell..

[37]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[38]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[39]  Thorsten Joachims,et al.  SVM Light: Support Vector Machine , 2002 .

[40]  Gianluca Calcagni,et al.  The geometry of learning , 2016, Journal of Mathematical Psychology.

[41]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[42]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[43]  Kenneth L. Clarkson,et al.  Optimal core-sets for balls , 2008, Comput. Geom..

[44]  Nello Cristianini,et al.  The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines , 1998, ICML.

[45]  David J. Crisp,et al.  A Geometric Interpretation of v-SVM Classifiers , 1999, NIPS.

[46]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[47]  Marc Teboulle,et al.  A conditional gradient method with linear rate of convergence for solving convex linear systems , 2004, Math. Methods Oper. Res..

[48]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[49]  Claudio Sartori,et al.  A New Algorithm for Training SVMs Using Approximate Minimal Enclosing Balls , 2010, CIARP.

[50]  V. N. Malozemov,et al.  Finding the Point of a Polyhedron Closest to the Origin , 1974 .

[51]  Peng Sun,et al.  Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids , 2008, Optim. Methods Softw..

[52]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[53]  Jacek M. Zurada,et al.  Generalized Core Vector Machines , 2006, IEEE Transactions on Neural Networks.

[54]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[55]  Jacek Gondzio,et al.  Exploiting separability in large-scale linear support vector machine training , 2011, Comput. Optim. Appl..

[56]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[57]  Laura Schweitzer,et al.  Advances In Kernel Methods Support Vector Learning , 2016 .

[58]  Hong Qiao,et al.  An online core vector machine with adaptive MEB adjustment , 2010, Pattern Recognit..

[59]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[60]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[61]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[62]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[63]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[64]  Suresh Venkatasubramanian,et al.  Streamed Learning: One-Pass SVMs , 2009, IJCAI.

[65]  S. M. Robinson Generalized equations and their solutions, part II: Applications to nonlinear programming , 1982 .