Novel Frank-Wolfe Methods for SVM Learning

SVM classifiers are very effective mining tools, but scalability for massive datasets is still an open issue. The Frank-Wolfe (FW) method deals with large-scale instances of the general problem of maximizing a concave function on the unit simplex, and can be specialized to SVM training to obtain algorithms with remarkable theoretical properties and competitive performance in practice. We present and analyze a variant of the FW method designed to obtain improved performance on large-scale SVM problems. The algorithm is based on a new way to perform away steps, a well-known strategy employed to accelerate the convergence of the basic FW method. We demonstrate that the method matches the guarantees in terms of convergence rate and number of iterations obtained by using classic away steps. In particular, the method enjoys a linear rate of convergence. On the practical side, we provide experiments on several classification datasets, and evaluate the results using statistical tests. Experiments show that our method is faster than the FW method and works well even in the cases in which classic away steps slow down the algorithm. Furthermore, these improvements are obtained without sacrificing the predictive accuracy of the obtained SVM model.

[1]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[2]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[3]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[4]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[5]  E. Alper Yildirim,et al.  Two Algorithms for the Minimum Enclosing Ball Problem , 2008, SIAM J. Optim..

[6]  Hong Qiao,et al.  An online core vector machine with adaptive MEB adjustment , 2010, Pattern Recognit..

[7]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[8]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[9]  S. Sathiya Keerthi,et al.  Convergence of a Generalized SMO Algorithm for SVM Classifier Design , 2002, Machine Learning.

[10]  Alexander G. Gray,et al.  Fast Stochastic Frank-Wolfe Algorithms for Nonlinear SVMs , 2010, SDM.

[11]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[12]  Kenneth L. Clarkson,et al.  Optimal core-sets for balls , 2008, Comput. Geom..

[13]  Peng Sun,et al.  Linear convergence of a modified Frank–Wolfe algorithm for computing minimum-volume enclosing ellipsoids , 2008, Optim. Methods Softw..

[14]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[15]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[16]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[17]  Thomas Hofmann,et al.  Predicting Structured Data (Neural Information Processing) , 2007 .

[18]  Claudio Sartori,et al.  Training Support Vector Machines using Frank-Wolfe Optimization Methods , 2012, Int. J. Pattern Recognit. Artif. Intell..

[19]  Jacek M. Zurada,et al.  Generalized Core Vector Machines , 2006, IEEE Transactions on Neural Networks.

[20]  Daniela di Serafino,et al.  Recent advances in nonlinear optimization and equilibrium problems : a tribute to Marco D'Apuzzo , 2012 .

[21]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines , 2003, J. Mach. Learn. Res..

[22]  José R. Dorronsoro,et al.  The convergence rate of the MDM algorithm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[23]  Piyush Kumar,et al.  A Linearly Convergent Linear-Time First-Order Algorithm for Support Vector Classification with a Core Set Result , 2011, INFORMS J. Comput..

[24]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[25]  Suresh Venkatasubramanian,et al.  Streamed Learning: One-Pass SVMs , 2009, IJCAI.

[26]  S. M. Robinson Generalized equations and their solutions, part II: Applications to nonlinear programming , 1982 .

[27]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[28]  Kristin P. Bennett,et al.  Duality and Geometry in SVM Classifiers , 2000, ICML.

[29]  Claudio Sartori,et al.  A New Algorithm for Training SVMs Using Approximate Minimal Enclosing Balls , 2010, CIARP.

[30]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[31]  Katya Scheinberg,et al.  An Efficient Implementation of an Active Set Method for SVMs , 2006, J. Mach. Learn. Res..

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[35]  Patrice Marcotte,et al.  Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[36]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[37]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[38]  Martin Jaggi,et al.  Coresets for polytope distance , 2009, SCG '09.

[39]  E. Gilbert An Iterative Procedure for Computing the Minimum of a Quadratic Form on a Convex Set , 1966 .

[40]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[41]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[42]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[43]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[44]  Jacek Gondzio,et al.  Exploiting separability in large-scale linear support vector machine training , 2011, Comput. Optim. Appl..

[45]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[46]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[47]  Marc Teboulle,et al.  A conditional gradient method with linear rate of convergence for solving convex linear systems , 2004, Math. Methods Oper. Res..

[48]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..