Weighted SGD for ℓp Regression with Randomized Preconditioning

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems-e.g., ℓ2 and ℓ1 regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system.By rewriting a deterministic ℓ p regression problem as a stochastic optimization problem, we connect pwSGD to several existing ℓ p solvers including RLA methods with algorithmic leveraging (RLA for short).We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Such SGD convergence rates are superior to other related SGD algorithm such as the weighted randomized Kaczmarz algorithm.Particularly, when solving ℓ1 regression with size n by d, pwSGD returns an approximate solution with ε relative error in the objective value in 𝒪(log n·nnz(A)+poly(d)/ε2) time. This complexity is uniformly better than that of RLA methods in terms of both ε and d when the problem is unconstrained. In the presence of constraints, pwSGD only has to solve a sequence of much simpler and smaller optimization problem over the same constraints. In general this is more efficient than solving the constrained subproblem required in RLA.For ℓ2 regression, pwSGD returns an approximate solution with ε relative error in the objective value and the solution vector measured in prediction norm in 𝒪(log n·nnz(A)+poly(d) log(1/ε)/ε) time. We show that for unconstrained ℓ2 regression, this complexity is comparable to that of RLA and is asymptotically better over several state-of-the-art solvers in the regime where the desired accuracy ε, high dimension n and low dimension d satisfy d ≥ 1/ε and n ≥ d2/ε. We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets, and the results are consistent with our theoretical findings and demonstrate that pwSGD converges to a medium-precision solution, e.g., ε = 10-3, more quickly.

[1]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[2]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[3]  R. Koenker,et al.  The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators , 1997 .

[4]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5]  Michael W. Mahoney,et al.  Robust Regression on MapReduce , 2013, ICML.

[6]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling , 2014, ArXiv.

[7]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[8]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[9]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[10]  Frank E. Curtis,et al.  A Self-Correcting Variable-Metric Algorithm for Stochastic Optimization , 2016, ICML.

[11]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[12]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..

[13]  Xin Xiao,et al.  On the Sensitivity of Shape Fitting Problems , 2012, FSTTCS.

[14]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[15]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[16]  Michael A. Saunders,et al.  LSRN: A Parallel Iterative Solver for Strongly Over- or Underdetermined Systems , 2011, SIAM J. Sci. Comput..

[17]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[18]  David P. Woodruff,et al.  The Fast Cauchy Transform and Faster Robust Linear Regression , 2012, SIAM journal on computing (Print).

[19]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[20]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[21]  Michael I. Jordan,et al.  A Linearly-Convergent Stochastic L-BFGS Algorithm , 2015, AISTATS.

[22]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[23]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[24]  Michael B. Cohen,et al.  Ridge Leverage Scores for Low-Rank Approximation , 2015, ArXiv.

[25]  David P. Woodruff,et al.  Subspace Embeddings and \(\ell_p\)-Regression Using Exponential Random Variables , 2013, COLT.

[26]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[27]  James T. Kwok,et al.  Accelerated Gradient Methods for Stochastic Optimization and Online Learning , 2009, NIPS.

[28]  Richard Peng,et al.  ℓp Row Sampling by Lewis Weights , 2014, ArXiv.

[29]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[30]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[31]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[32]  Sivan Toledo,et al.  Blendenpik: Supercharging LAPACK's Least-Squares Solver , 2010, SIAM J. Sci. Comput..

[33]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[34]  Richard Peng,et al.  Uniform Sampling for Matrix Approximation , 2014, ITCS.

[35]  Michael W. Mahoney,et al.  Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments , 2015, Proceedings of the IEEE.

[36]  David Eppstein,et al.  Approximating center points with iterated radon points , 1993, SCG '93.

[37]  Christopher Ré,et al.  Weighted SGD for ℓp Regression with Randomized Preconditioning , 2015, SODA.

[38]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[39]  David P. Woodruff,et al.  Subspace embeddings for the L1-norm with applications , 2011, STOC '11.

[40]  K. Clarkson Subgradient and sampling algorithms for l1 regression , 2005, SODA '05.

[41]  Gene H. Golub,et al.  Matrix computations , 1983 .

[42]  A. Sluis Condition numbers and equilibration of matrices , 1969 .

[43]  S. Portnoy On computation of regression quantiles: Making the Laplacian Tortoise faster , 1997 .

[44]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[45]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[46]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[47]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[48]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[49]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[50]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[51]  Michael B. Cohen,et al.  Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities , 2016, SODA.

[52]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[53]  Michael W. Mahoney,et al.  Quantile Regression for Large-Scale Applications , 2013, SIAM J. Sci. Comput..