Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update

In second-order optimization, a potential bottleneck can be computing the Hessian matrix of the optimized function at every iteration. Randomized sketching has emerged as a powerful technique for constructing estimates of the Hessian which can be used to perform approximate Newton steps. This involves multiplication by a random sketching matrix, which introduces a trade-off between the computational cost of sketching and the convergence rate of the optimization algorithm. A theoretically desirable but practically much too expensive choice is to use a dense Gaussian sketching matrix, which produces unbiased estimates of the exact Newton step and which offers strong problem-independent convergence guarantees. We show that the Gaussian sketching matrix can be drastically sparsified, significantly reducing the computational cost of sketching, without substantially affecting its convergence properties. This approach, called Newton-LESS, is based on a recently introduced sketching technique: LEverage Score Sparsified (LESS) embeddings. We prove that Newton-LESS enjoys nearly the same problem-independent local convergence rate as Gaussian embeddings, not just up to constant factors but even down to lower order terms, for a large class of optimization tasks. In particular, this leads to a new state-of-the-art convergence result for an iterative least squares solver. Finally, we extend LESS embeddings to include uniformly sparsified random sign matrices which can be implemented efficiently and which perform well in numerical experiments.

[2]  Mert Pilanci,et al.  Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization , 2020, NeurIPS.

[3]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[4]  Zhenyu Liao,et al.  Sparse sketches with small inversion bias , 2020, COLT.

[5]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[6]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[7]  David P. Woodruff,et al.  Sharper Bounds for Regularized Data Fitting , 2016, APPROX-RANDOM.

[8]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[9]  V. Rokhlin,et al.  A fast randomized algorithm for overdetermined linear least-squares regression , 2008, Proceedings of the National Academy of Sciences.

[10]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[11]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[12]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[13]  Kannan Ramchandran,et al.  LocalNewton: Reducing Communication Bottleneck for Distributed Learning , 2021, ArXiv.

[14]  R. Couillet,et al.  Random Matrix Methods for Wireless Communications , 2011 .

[15]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[16]  Mendelson Shahar,et al.  Robust covariance estimation under L_4-L_2 norm equivalence , 2018 .

[17]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[18]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[19]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[20]  Shusen Wang,et al.  GIANT: Globally Improved Approximate Newton Method for Distributed Optimization , 2017, NeurIPS.

[21]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[22]  Sivan Toledo,et al.  Blendenpik: Supercharging LAPACK's Least-Squares Solver , 2010, SIAM J. Sci. Comput..

[23]  Peng Xu,et al.  Inexact Non-Convex Newton-Type Methods , 2018, 1802.06925.

[24]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[25]  Mert Pilanci,et al.  Limiting Spectrum of Randomized Hadamard Transform and Optimal Iterative Sketching Methods , 2020, ArXiv.

[26]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[27]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[28]  Kristof Van Laerhoven,et al.  Introducing WESAD, a Multimodal Dataset for Wearable Stress and Affect Detection , 2018, ICMI.

[29]  Petros Drineas,et al.  Lectures on Randomized Numerical Linear Algebra , 2017, IAS/Park City Mathematics Series.

[30]  Faster Least Squares Optimization , 2019, ArXiv.

[31]  Michael A. Saunders,et al.  LSRN: A Parallel Iterative Solver for Strongly Over- or Underdetermined Systems , 2011, SIAM J. Sci. Comput..

[32]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[33]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[34]  Peng Xu,et al.  Newton-type methods for non-convex optimization under inexact Hessian information , 2017, Math. Program..

[35]  Michael W. Mahoney,et al.  Sub-sampled Newton methods , 2018, Math. Program..

[36]  V. Koltchinskii,et al.  Concentration inequalities and moment bounds for sample covariance operators , 2014, 1405.2468.

[37]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[38]  Michael W. Mahoney,et al.  Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[39]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[40]  Daniele Calandriello,et al.  Exact sampling of determinantal point processes with sublinear time preprocessing , 2019, NeurIPS.

[41]  Michael W. Mahoney,et al.  RandNLA , 2016, Commun. ACM.

[42]  Michael W. Mahoney,et al.  Determinantal Point Processes in Randomized Numerical Linear Algebra , 2020, Notices of the American Mathematical Society.

[43]  Michal Derezinski,et al.  Fast determinantal point processes via distortion-free intermediate sampling , 2018, COLT.

[44]  Michael B. Cohen,et al.  Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities , 2016, SODA.

[45]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[46]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[47]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..