Large-Scale Support Vector Machines: Algorithms and Theory

Support vector machines (SVMs) are a very popular method for binary classification. Traditional training algorithms for SVMs, such as chunking and SMO, scale superlinearly with the number of examples, which quickly becomes infeasible for large training sets. Since it has been commonly observed that dataset sizes have been growing steadily larger over the past few years, this necessitates the development of training algorithms that scale at worst linearly with the number of examples. We survey work on SVM training methods that target this large-scale learning regime. Most of these algorithms use either (1) variants of primal stochastic gradient descent (SGD), or (2) quadratic programming in the dual. For (1), we discuss why SGD generalizes well even though it is poor at optimization, and describe algorithms such as Pegasos and FOLOS that extend basic SGD to quickly solve the SVM problem. For (2), we survey recent methods such as dual coordinate-descent and BMRM, which have proven competitive with the SGD-based solvers. We also discuss the recent work of [Shalev-Shwartz and Srebro, 2008] that concludes that training time for SVMs should actually decrease as the training set size increases, and explain why SGD-based algorithms are able to satisfy this desideratum. 1. WHY LARGE-SCALE LEARNING? Supervised learning involves analyzing a given set of labelled observations (the training set) so as to predict the labels of unlabelled future data (the test set). Specifically, the goal is to learn some function that describes the relationship between observations and their labels. Archetypal examples of supervised learning include recognizing handwritten digits and spam classification. One parameter of interest for a supervised learning problem is the size of the training set. We call a learning problem large-scale if its training set cannot be stored in a modern computer’s memory [Langford, 2008]. A deeper definition of large-scale learning is that it consists of problems where the main computational constraint is the amount of time available, rather than the number of examples [Bottou and Bousquet, 2007]. A large training set poses a challenge for the computational complexity of a learning algorithm: in order for algorithms to be feasible on such datasets, they must scale at worst linearly with the number of examples. Most learning problems that have been studied thus far are mediumscale, in that they assume that the training set can be stored in memory and repeatedly scanned. However, with the growing volume of data in the last few years, we have started to see problems that are large-scale. An example of this is ad-click data for search engines. When most modern search engines produce results for a query, they also display a number of (hopefully) relevant ads. When the user clicks on an ad, the search engine receives some commission from the ad sponsor. This means that to price the ad reasonably, the search company needs to have a good estimate of whether, for a given query, an ad is likely to be clicked or not. One way to formulate this as a learning problem is to have training examples consisting of an ad and its corresponding search query, and a label denoting whether or not the ad was clicked. We wish to learn a classifier that tells us whether a given ad is likely to be clicked if it were generated for a given query. Given the volume of queries search engines process (Google processes around 7.5 billion queries a month [Searchenginewatch.com, 2008]), the potential size of such a training set can far exceed the memory capacity of a modern system. Conventional learning algorithms cannot handle such problems, because we can no longer store and have ready access to the data in memory. This necessitates the development of new algorithms, and a careful study of the challenges posed by this scale of problem. An extra motivation for studying such algorithms is that they can also be applied to medium-scale problems, which are still of immediate practical interest currently. Our focus in this document is how a support vector machine (SVM), a popular method for binary classification that is based on strong theory and enjoys good practical performance, can be scaled to work with large training sets. There have been two strands of work in the literature on this topic. The first is a theoretical analysis of the problem, in an attempt to understand how learning algorithms need to be changed to adapt to a large-scale setting. The other is the design of training algorithms for SVMs that work well for these large datasets, including the recent Pegasos solver [Shalev-Shwartz et al., 2007], which leverages the theoretical results on large-scale learning to actually decrease its runtime when given more examples. We discuss both strands, and attempt to identify the limitations of current solvers. First, let us define more precisely the large-scale setting that we are considering, and describe some general approaches to solving such problems. 1.1 Batch and online algorithms When we discuss supervised learning problems with a large training set, we are implicitly assuming that the learning is done in the batch framework. We do not focus on the online learning scenario, which consists of a potentially infinite stream of training examples presented one at a time, although such a setting can certainly be thought of as large-scale learning. However, it is possible for an online algorithm to solve a batch problem, and in fact this might be desirable in the large-scale setting, as we discuss below. More generally, an intermediate between batch and online algorithms is what we call an online-style algorithm. This is an algorithm that assumes a batch setting, but only uses a sublinear amount of memory, and whose computational complexity scales only sublinearly with the number of examples. This precludes batch algorithms that repeatedly process the training set at each iteration. A standard online algorithm can be converted into an online-style algorithm

[1]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[2]  Stéphane Canu,et al.  Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets" , 2007, J. Mach. Learn. Res..

[3]  Sören Sonnenburg,et al.  Optimized cutting plane algorithm for support vector machines , 2008, ICML '08.

[4]  Vladimir Vapnik Estimations of dependences based on statistical data , 1982 .

[5]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[6]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Budget , 2008, SIAM J. Comput..

[7]  Alexander J. Smola,et al.  Bundle Methods for Machine Learning , 2007, NIPS.

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  Ingo Steinwart,et al.  Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds , 2003, NIPS.

[10]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[11]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[12]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[13]  Léon Bottou,et al.  On-line learning for very large data sets: Research Articles , 2005 .

[14]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  Greg Schohn,et al.  Less is More: Active Learning with Support Vector Machines , 2000, ICML.

[17]  Carl E. Rasmussen,et al.  The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[18]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[19]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[20]  Kristin P. Bennett,et al.  Duality and Geometry in SVM Classifiers , 2000, ICML.

[21]  S. V. N. Vishwanathan,et al.  A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning , 2008, J. Mach. Learn. Res..

[22]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[23]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[24]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[25]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[26]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[27]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[28]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[29]  S. V. N. Vishwanathan,et al.  A quasi-Newton approach to non-smooth convex optimization , 2008, ICML '08.

[30]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[31]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[32]  I. Tsang,et al.  Authors' Reply to the "Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets" , 2007 .

[33]  S. Shalev-Shwartz,et al.  Fast Convergence Rates for Excess Regularized Risk with Application to SVM , 2008 .

[34]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[35]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[36]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[37]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[38]  Igor Durdanovic,et al.  Parallel Support Vector Machines: The Cascade SVM , 2004, NIPS.

[39]  Jason Weston,et al.  Solving multiclass support vector machines with LaRank , 2007, ICML '07.

[40]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[41]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[42]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[43]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[44]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[45]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[46]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[47]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.