A general framework for fast stagewise algorithms

Forward stagewise regression follows a very simple strategy for constructing a sequence of sparse regression estimates: it starts with all coefficients equal to zero, and iteratively updates the coefficient (by a small amount e) of the variable that achieves the maximal absolute inner product with the current residual. This procedure has an interesting connection to the lasso: under some conditions, it is known that the sequence of forward stagewise estimates exactly coincides with the lasso path, as the step size e goes to zero. Furthermore, essentially the same equivalence holds outside of least squares regression, with the minimization of a differentiable convex loss function subject to an l1 norm constraint (the stagewise algorithm now updates the coefficient corresponding to the maximal absolute component of the gradient). Even when they do not match their l1-constrained analogues, stagewise estimates provide a useful approximation, and are computationally appealing. Their success in sparse modeling motivates the question: can a simple, effective strategy like forward stagewise be applied more broadly in other regularization settings, beyond the l1 norm and sparsity? The current paper is an attempt to do just this. We present a general framework for stagewise estimation, which yields fast algorithms for problems such as group-structured learning, matrix completion, image denoising, and more.

[1]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[2]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[3]  Mario Bertero,et al.  The Stability of Inverse Problems , 1980 .

[4]  Gene H. Golub,et al.  Matrix computations , 1983 .

[5]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[6]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[7]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[8]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[9]  B. Silverman,et al.  Nonparametric Regression and Generalized Linear Models: A roughness penalty approach , 1993 .

[10]  Paul H. C. Eilers,et al.  Flexible smoothing with B-splines and penalties , 1996 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[13]  Sergey Bakin,et al.  Adaptive regression and model selection in data mining problems , 1999 .

[14]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[15]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[16]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[17]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[18]  Bogdan E. Popescu,et al.  Gradient Directed Regularization , 2004 .

[19]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[20]  Stephen J. Wright,et al.  Simultaneous Variable Selection , 2005, Technometrics.

[21]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[24]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[25]  Mee Young Park,et al.  L1‐regularization path algorithm for generalized linear models , 2007 .

[26]  Peng Zhao,et al.  Stagewise Lasso , 2007, J. Mach. Learn. Res..

[27]  Alexander J. Smola,et al.  A scalable modular convex solver for regularized risk minimization , 2007, KDD '07.

[28]  R. Tibshirani,et al.  Forward stagewise regression and the monotone lasso , 2007, 0705.0269.

[29]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[30]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[31]  Stephen P. Boyd,et al.  1 Trend Filtering , 2009, SIAM Rev..

[32]  Antonin Chambolle,et al.  On Total Variation Minimization and Surface Evolution Using Parametric Maximum Flows , 2009, International Journal of Computer Vision.

[33]  Hervé Abdi,et al.  Wiley Interdisciplinary Reviews: Computational Statistics , 2010 .

[34]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[35]  Martin Jaggi,et al.  A Simple Algorithm for Nuclear Norm Regularized Problems , 2010, ICML.

[36]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[37]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[38]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[39]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[40]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[41]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[42]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[43]  J. Friedman Fast sparse regression and classification , 2012 .

[44]  Matthijs Douze,et al.  Large-scale image classification with trace-norm regularization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Martin Jaggi,et al.  Regularization Paths with Guarantees for Convex Semidefinite Optimization , 2012, AISTATS.

[46]  R. Tibshirani The Lasso Problem and Uniqueness , 2012, 1206.0313.

[47]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[48]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[49]  R. Tibshirani Adaptive piecewise polynomial estimation via trend filtering , 2013, 1304.2986.

[50]  Jieping Ye,et al.  Sparse trace norm regularization , 2012, Comput. Stat..

[51]  Alexander J. Smola,et al.  Trend Filtering on Graphs , 2014, J. Mach. Learn. Res..