Weighted SGD for ℓp Regression with Randomized Preconditioning

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems---e.g., $\ell_2$ and $\ell_1$ regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Particularly, when solving $\ell_1$ regression with size $n$ by $d$, pwSGD returns an approximate solution with $\epsilon$ relative error in the objective value in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d)/\epsilon^2)$ time. This complexity is uniformly better than that of RLA methods in terms of both $\epsilon$ and $d$ when the problem is unconstrained. For $\ell_2$ regression, pwSGD returns an approximate solution with $\epsilon$ relative error in the objective value and the solution vector measured in prediction norm in $\mathcal{O}(\log n \cdot \text{nnz}(A) + \text{poly}(d) \log(1/\epsilon) /\epsilon)$ time. We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets.

[1]  A. Sluis Condition numbers and equilibration of matrices , 1969 .

[2]  Gene H. Golub,et al.  Matrix computations , 1983 .

[3]  David Eppstein,et al.  Approximating center points with iterated radon points , 1993, SCG '93.

[4]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[5]  R. Koenker,et al.  The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators , 1997 .

[6]  S. Portnoy On computation of regression quantiles: Making the Laplacian Tortoise faster , 1997 .

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  D. Donoho,et al.  Atomic Decomposition by Basis Pursuit , 2001 .

[9]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[10]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[11]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[12]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[13]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[14]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[15]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[16]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[17]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[18]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[19]  James T. Kwok,et al.  Accelerated Gradient Methods for Stochastic Optimization and Online Learning , 2009, NIPS.

[20]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  Sivan Toledo,et al.  Blendenpik: Supercharging LAPACK's Least-Squares Solver , 2010, SIAM J. Sci. Comput..

[23]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[24]  David P. Woodruff,et al.  Subspace embeddings for the L1-norm with applications , 2011, STOC '11.

[25]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[26]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[27]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..

[28]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[29]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[30]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[31]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[32]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[33]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[34]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[35]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[36]  David P. Woodruff,et al.  Subspace Embeddings and \(\ell_p\)-Regression Using Exponential Random Variables , 2013, COLT.

[37]  Michael W. Mahoney,et al.  Robust Regression on MapReduce , 2013, ICML.

[38]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling , 2014, ArXiv.

[39]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[40]  Michael A. Saunders,et al.  LSRN: A Parallel Iterative Solver for Strongly Over- or Underdetermined Systems , 2011, SIAM J. Sci. Comput..

[41]  Michael W. Mahoney,et al.  Quantile Regression for Large-Scale Applications , 2013, SIAM J. Sci. Comput..

[42]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[43]  Richard Peng,et al.  Lp Row Sampling by Lewis Weights , 2015, STOC.

[44]  Michael B. Cohen,et al.  Ridge Leverage Scores for Low-Rank Approximation , 2015, ArXiv.

[45]  Richard Peng,et al.  Uniform Sampling for Matrix Approximation , 2014, ITCS.

[46]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[47]  Michael I. Jordan,et al.  A Linearly-Convergent Stochastic L-BFGS Algorithm , 2015, AISTATS.

[48]  David P. Woodruff,et al.  The Fast Cauchy Transform and Faster Robust Linear Regression , 2012, SIAM J. Comput..

[49]  Michael B. Cohen,et al.  Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities , 2016, SODA.

[50]  Christopher Ré,et al.  Weighted SGD for ℓp Regression with Randomized Preconditioning , 2016, SODA.

[51]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[52]  Michael W. Mahoney,et al.  Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments , 2015, Proceedings of the IEEE.

[53]  Frank E. Curtis,et al.  A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization , 2016, ICML.

[54]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..