Recent Advances of Large-Scale Linear Classification This paper is a survey on development of optimization methods to construct linear classifiers suitable for large-scale applications; for some data, accuracy is close to that of nonlinear classifiers.

Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classi- fiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and ap- plied them to some large-scale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.

[1]  Gérard Dreyfus,et al.  Single-layer learning revisited: a stepwise procedure for building and training a neural network , 1989, NATO Neurocomputing.

[2]  Edward Y. Chang,et al.  Parallelizing Support Vector Machines on Distributed Computers , 2007, NIPS.

[3]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[4]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[5]  Joshua Goodman,et al.  Sequential Conditional Generalized Iterative Scaling , 2002, ACL.

[6]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[7]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[10]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[11]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[12]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[13]  Chih-Jen Lin,et al.  Iterative Scaling and Coordinate Descent Methods for Maximum Entropy , 2009, ACL.

[14]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[15]  B. Mercier,et al.  A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[18]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[19]  Chih-Jen Lin,et al.  A sequential dual method for large scale multi-class linear svms , 2008, KDD.

[20]  Rong Yan,et al.  A Faster Iterative Scaling Algorithm for Conditional Exponential Model , 2003, ICML.

[21]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[22]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[23]  Stephen J. Wright,et al.  ASSET: Approximate Stochastic Subgradient Estimation Training for Support Vector Machines , 2012, ICPRAM.

[24]  Dianne P. O'Leary,et al.  Adaptive constraint reduction for training support vector machines. , 2008 .

[25]  Nathan Ratliff,et al.  Online) Subgradient Methods for Structured Prediction , 2007 .

[26]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[27]  Chih-Jen Lin,et al.  Generalized Bradley-Terry Models and Multi-Class Probability Estimates , 2006, J. Mach. Learn. Res..

[28]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[29]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[30]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[31]  Carsten Wiuf,et al.  Bounded coordinate-descent for biological sequence classification in high dimensional predictor space , 2010, KDD.

[32]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[33]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[34]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[35]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[36]  Stephen J. Wright,et al.  Sparse reconstruction by separable approximation , 2009, IEEE Trans. Signal Process..

[37]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[38]  Jieping Ye,et al.  Large-scale sparse logistic regression , 2009, KDD.

[39]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[40]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[41]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[42]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[43]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[44]  Chih-Jen Lin,et al.  Large linear classification when data cannot fit in memory , 2010, KDD '10.

[45]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[46]  Joachim M. Buhmann,et al.  Kernel Expansion for Online Preference Tracking , 2008, ISMIR.

[47]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[48]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[49]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[50]  Dan Roth,et al.  Selective block minimization for faster convergence of limited memory large-scale linear models , 2011, KDD.

[51]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[52]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[53]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[54]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[55]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[56]  Michael C. Ferris,et al.  Interior-Point Methods for Massive Support Vector Machines , 2002, SIAM J. Optim..

[57]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[58]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[59]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[60]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[61]  H. Robbins A Stochastic Approximation Method , 1951 .

[62]  Yaakov Tsaig,et al.  Fast Solution of $\ell _{1}$ -Norm Minimization Problems When the Solution May Be Sparse , 2008, IEEE Transactions on Information Theory.

[63]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[64]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[65]  Isabelle Guyon,et al.  Comparison of classifier methods: a case study in handwritten digit recognition , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[66]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[67]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[68]  E. M. Gertz,et al.  Support vector machine classifiers for large data sets. , 2006 .

[69]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[70]  Sören Sonnenburg,et al.  COFFIN: A Computational Framework for Linear SVMs , 2010, ICML.

[71]  Ming-Syan Chen,et al.  Efficient Kernel Approximation for Large-Scale Support Vector Machine Classification , 2011, SDM.

[72]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[73]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[74]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[75]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[76]  Masashi Sugiyama,et al.  Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparse Learning , 2009 .

[77]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[78]  G. Wahba,et al.  Multicategory Support Vector Machines , Theory , and Application to the Classification of Microarray Data and Satellite Radiance Data , 2004 .

[79]  O. Mangasarian,et al.  Massive data discrimination via linear support vector machines , 2000 .

[80]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[81]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[82]  P. Tseng,et al.  On the convergence of the coordinate descent method for convex differentiable minimization , 1992 .

[83]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[84]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[85]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[86]  Kim-Chuan Toh,et al.  A coordinate gradient descent method for ℓ1-regularized convex minimization , 2011, Comput. Optim. Appl..

[87]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[88]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[89]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[90]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[91]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[92]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[93]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[94]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[95]  Deepayan Chakrabarti,et al.  Contextual advertising by combining relevance with click feedback , 2008, WWW.

[96]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[97]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[98]  Chih-Jen Lin,et al.  A formal analysis of stopping criteria of decomposition methods for support vector machines , 2002, IEEE Trans. Neural Networks.

[99]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[100]  Sören Sonnenburg,et al.  Optimized cutting plane algorithm for support vector machines , 2008, ICML '08.

[101]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[102]  Wotao Yin,et al.  A Fast Hybrid Algorithm for Large-Scale l1-Regularized Logistic Regression , 2010, J. Mach. Learn. Res..

[103]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[104]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[105]  Vivek S. Borkar,et al.  Distributed Asynchronous Incremental Subgradient Methods , 2001 .

[106]  Georgios B. Giannakis,et al.  Consensus-Based Distributed Support Vector Machines , 2010, J. Mach. Learn. Res..

[107]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[108]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[109]  J. S. Cramer The Origins of Logistic Regression , 2002 .

[110]  Zheng Chen,et al.  P-packSVM: Parallel Primal grAdient desCent Kernel SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[111]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[112]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[113]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[114]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[115]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[116]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[117]  Gerhard Weikum,et al.  Fast logistic regression for text categorization with variable-length n-grams , 2008, KDD.

[118]  Chih-Jen Lin,et al.  Dual coordinate descent methods for logistic regression and maximum entropy models , 2011, Machine Learning.

[119]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.