A Comparison of Optimization Methods for Large-scale L 1-regularized Linear Classification

Large-scale linear classification is widely used in many areas. The L1-regularized form can be applied for feature selection, but its non-differentiability causes more difficulties in training. Various optimization methods have been proposed in recent years, but no serious comparison among them has been made. In this paper, we discuss several state of the art methods and propose two new implementations. We then conduct a comprehensive comparison. Results show that decomposition methods, in particular coordinate descent methods, are very suitable for training large document data.

[1]  D. Bertsekas,et al.  TWO-METRIC PROJECTION METHODS FOR CONSTRAINED OPTIMIZATION* , 1984 .

[2]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[3]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[6]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[7]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[8]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[9]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[10]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[11]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[12]  Jun'ichi Tsujii,et al.  Evaluation and Extension of Maximum Entropy Models with Inequality Constraints , 2003, EMNLP.

[13]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[14]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[15]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[16]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[17]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[18]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[19]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[20]  Mário A. T. Figueiredo,et al.  Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems , 2007, IEEE Journal of Selected Topics in Signal Processing.

[21]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale $\ell_1$-Regularized Least Squares , 2007, IEEE Journal of Selected Topics in Signal Processing.

[22]  David Madigan,et al.  Large-Scale Bayesian Logistic Regression for Text Categorization , 2007, Technometrics.

[23]  S. Sathiya Keerthi,et al.  A Fast Tracking Algorithm for Generalized LARS/LASSO , 2007, IEEE Transactions on Neural Networks.

[24]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[25]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[26]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[27]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[28]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[29]  Yin Zhang,et al.  Fixed-Point Continuation for l1-Minimization: Methodology and Convergence , 2008, SIAM J. Optim..

[30]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[31]  Yaakov Tsaig,et al.  Fast Solution of $\ell _{1}$ -Norm Minimization Problems When the Solution May Be Sparse , 2008, IEEE Transactions on Information Theory.

[32]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.

[33]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[34]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[35]  Cho-Jui Hsieh,et al.  Coordinate Descent Method for Large-scale L 2-loss Linear SVM , 2008 .

[36]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[37]  Chih-Jen Lin,et al.  Iterative Scaling and Coordinate Descent Methods for Maximum Entropy , 2009, ACL/IJCNLP.

[38]  Tapio Elomaa,et al.  A Walk from 2-Norm SVM to 1-Norm SVM , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[39]  Jieping Ye,et al.  Large-scale sparse logistic regression , 2009, KDD.

[40]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[41]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[42]  Yoram Singer,et al.  Boosting with structural sparsity , 2009, ICML '09.

[43]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[44]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[45]  Wotao Yin,et al.  A Fast Hybrid Algorithm for Large-Scale l1-Regularized Logistic Regression , 2010, J. Mach. Learn. Res..

[46]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[47]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[48]  Kim-Chuan Toh,et al.  A coordinate gradient descent method for ℓ1-regularized convex minimization , 2011, Comput. Optim. Appl..