Pegasos: primal estimated sub-gradient solver for SVM

We describe and analyze a simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines (SVM). Our method alternates between stochastic gradient descent steps and projection steps. We prove that the number of iterations required to obtain a solution of accuracy ε is Õ(1/ε). In contrast, previous analyses of stochastic gradient descent methods require Ω (1/ε2) iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is Õ (d/(λε)), where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function. We demonstrate the efficiency and applicability of our approach by conducting experiments on large text classification problems, comparing our solver to existing state-of-the-art SVM solvers. For example, it takes less than 5 seconds for our solver to converge when solving a text classification problem from Reuters Corpus Volume 1 (RCV1) with 800,000 training examples.

[1]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[4]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[5]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[6]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[9]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[10]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[11]  Shun-ichi Amari,et al.  Statistical analysis of learning dynamics , 1999, Signal Process..

[12]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[13]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[14]  Andrzej Stachurski,et al.  Parallel Optimization: Theory, Algorithms and Applications , 2000, Parallel Distributed Comput. Pract..

[15]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[16]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[17]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[18]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[19]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[20]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[21]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[22]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[23]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[24]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[25]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[26]  Don R. Hush,et al.  QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines , 2006, J. Mach. Learn. Res..

[27]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[28]  Y. Singer,et al.  Logarithmic Regret Algorithms for Strongly Convex Repeated Games , 2007 .

[29]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[30]  Alexander J. Smola,et al.  Bundle Methods for Machine Learning , 2007, NIPS.

[31]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[32]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[33]  Claudio Gentile,et al.  Improved Risk Tail Bounds for On-Line Algorithms , 2005, IEEE Transactions on Information Theory.

[34]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[35]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[36]  Nathan Srebro,et al.  Fast Rates for Regularized Objectives , 2008, NIPS.

[37]  Quoc V. Le,et al.  Proximal regularization for online and batch learning , 2009, ICML '09.

[38]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..