Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence

We consider minimizing a smooth and strongly convex objective function using a stochastic Newton method. At each iteration, the algorithm is given an oracle access to a stochastic estimate of the Hessian matrix. The oracle model includes popular algorithms such as Subsampled Newton and Newton Sketch, which can efficiently construct stochastic Hessian estimates for many tasks, e.g., training machine learning models. Despite using second-order information, these existing methods do not exhibit superlinear convergence, unless the stochastic noise is gradually reduced to zero during the iteration, which would lead to a computational blow-up in the per-iteration cost. We propose to address this limitation with Hessian averaging: instead of using the most recent Hessian estimate, our algorithm maintains an average of all the past estimates. This reduces the stochastic noise while avoiding the computational blow-up. We show that this scheme exhibits local Q-superlinear convergence with a non-asymptotic rate of $$(\varUpsilon \sqrt{\log (t)/t}\,)^{t}$$ ( Υ log ( t ) / t ) t , where $$\varUpsilon $$ Υ is proportional to the level of stochastic noise in the Hessian oracle. A potential drawback of this (uniform averaging) approach is that the averaged estimates contain Hessian information from the global phase of the method, i.e., before the iterates converge to a local neighborhood. This leads to a distortion that may substantially delay the superlinear convergence until long after the local neighborhood is reached. To address this drawback, we study a number of weighted averaging schemes that assign larger weights to recent Hessians, so that the superlinear convergence arises sooner, albeit with a slightly slower rate. Remarkably, we show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage, and still exhibits a superlinear convergence rate nearly (up to a logarithmic factor) matching that of uniform Hessian averaging.

[1]  E. Rebrova,et al.  Sharp Analysis of Sketch-and-Project Methods via a Connection to Randomized Singular Value Decomposition , 2022, ArXiv.

[2]  Michael W. Mahoney,et al.  Asymptotic Convergence Rate and Statistical Inference for Stochastic Sequential Quadratic Programming , 2022, ArXiv.

[3]  Xun Qian,et al.  Basis Matters: Better Communication-Efficient Second Order Methods for Federated Learning , 2021, AISTATS.

[4]  Stephen J. Wright,et al.  Inexact Newton-CG algorithms with complexity guarantees , 2021, IMA Journal of Numerical Analysis.

[5]  M. Anitescu,et al.  Inequality constrained stochastic nonlinear optimization via active-set sequential quadratic programming , 2021, Mathematical Programming.

[6]  Mert Pilanci,et al.  Newton-LESS: Sparsification without Trade-offs for the Sketched Newton Update , 2021, NeurIPS.

[7]  Peter Richtárik,et al.  FedNL: Making Newton-Type Methods Applicable to Federated Learning , 2021, ICML.

[8]  Mert Pilanci,et al.  Adaptive Newton Sketch: Linear-time Optimization with Quadratic Convergence and Effective Hessian Dimensionality , 2021, ICML.

[9]  Peter Richtárik,et al.  Distributed Second Order Methods with Fast Rates and Compressed Communication , 2021, ICML.

[10]  M. Anitescu,et al.  An adaptive stochastic sequential quadratic programming with differentiable exact augmented lagrangians , 2021, Mathematical Programming.

[11]  Zhenyu Liao,et al.  Sparse sketches with small inversion bias , 2020, COLT.

[12]  Mert Pilanci,et al.  Debiasing Distributed Second Order Optimization with Surrogate Sketching and Scaled Regularization , 2020, NeurIPS.

[13]  Michael W. Mahoney,et al.  Precise expressions for random projections: Low-rank approximation and randomized Newton , 2020, NeurIPS.

[14]  Anton Rodomanov,et al.  New Results on Superlinear Convergence of Classical Quasi-Newton Methods , 2020, Journal of Optimization Theory and Applications.

[15]  Aryan Mokhtari,et al.  Non-asymptotic superlinear convergence of standard quasi-Newton methods , 2020, Mathematical Programming.

[16]  Y. Nesterov,et al.  Rates of superlinear convergence for classical quasi-Newton methods , 2020, Mathematical Programming.

[17]  Anton Rodomanov,et al.  Greedy Quasi-Newton Methods with Explicit Superlinear Convergence , 2020, SIAM J. Optim..

[18]  Dmitry Kovalev,et al.  Stochastic Newton and Cubic Newton Methods with Simple Local Linear-Quadratic Rates , 2019, ArXiv.

[19]  A. Krause,et al.  Convergence Analysis of Block Coordinate Algorithms with Determinantal Sampling , 2019, AISTATS.

[20]  Mark W. Schmidt,et al.  Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation , 2019, AISTATS.

[21]  Fred Roosta,et al.  Convergence of Newton-MR under Inexact Hessian Information , 2019, SIAM J. Optim..

[22]  Dmitry Kovalev,et al.  RSN: Randomized Subspace Newton , 2019, NeurIPS.

[23]  Michael W. Mahoney,et al.  Distributed estimation of the inverse Hessian by determinantal averaging , 2019, NeurIPS.

[24]  Zaiwen Wen,et al.  Globally Convergent Levenberg-Marquardt Method for Phase Retrieval , 2019, IEEE Transactions on Information Theory.

[25]  Zhihua Zhang,et al.  Do Subsampled Newton Methods Work for High-Dimensional Data? , 2019, AAAI.

[26]  Michael W. Mahoney,et al.  Sub-sampled Newton methods , 2018, Math. Program..

[27]  Stefania Bellavia,et al.  Subsampled inexact Newton methods for minimizing large sums of convex functions , 2018, IMA Journal of Numerical Analysis.

[28]  Yi Zhou,et al.  Sample Complexity of Stochastic Variance-Reduced Cubic Regularization for Nonconvex Optimization , 2018, AISTATS.

[29]  Peter Richtárik,et al.  Randomized Block Cubic Newton Method , 2018, ICML.

[30]  Peng Xu,et al.  Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study , 2017, SDM.

[31]  Haishan Ye,et al.  Approximate Newton Methods and Their Local Convergence , 2017, ICML.

[32]  Jorge Nocedal,et al.  An investigation of Newton-Sketch and subsampled Newton methods , 2017, Optim. Methods Softw..

[33]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[34]  Katya Scheinberg,et al.  Convergence Rate Analysis of a Stochastic Trust-Region Method via Supermartingales , 2016, INFORMS Journal on Optimization.

[35]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[36]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[37]  Sham M. Kakade,et al.  Faster Eigenvector Computation via Shift-and-Invert Preconditioning , 2016, ICML.

[38]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[39]  Haipeng Luo,et al.  Efficient Second Order Online Learning by Sketching , 2016, NIPS.

[40]  Shai Shalev-Shwartz,et al.  SDCA without Duality, Regularization, and Individual Convexity , 2016, ICML.

[41]  Elad Hazan,et al.  Fast and Simple PCA via Convex Optimization , 2015, ArXiv.

[42]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[43]  Peter Richtárik,et al.  Randomized Iterative Methods for Linear Systems , 2015, SIAM J. Matrix Anal. Appl..

[44]  Zeyuan Allen Zhu,et al.  Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives , 2015, ICML.

[45]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[46]  Katya Scheinberg,et al.  Stochastic optimization using a trust-region method and random models , 2015, Mathematical Programming.

[47]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[48]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[49]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[50]  Michael A. Saunders,et al.  CG Versus MINRES: An Empirical Comparison , 2012 .

[51]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[52]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[53]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[54]  J. Tropp FREEDMAN'S INEQUALITY FOR MATRIX MARTINGALES , 2011, 1101.3039.

[55]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[56]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[57]  Vincent Nesme,et al.  Note on sampling without replacing from a finite collection of matrices , 2010, ArXiv.

[58]  V. Koltchinskii,et al.  High Dimensional Probability , 2006, math/0612726.

[59]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[60]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[61]  Stanislav Abaimov,et al.  Understanding Machine Learning , 2022, Machine Learning for Cyber Agents.

[62]  Guanghui Lan,et al.  First-order and Stochastic Optimization Methods for Machine Learning , 2020 .

[63]  Ananth Grama,et al.  GPU Accelerated Sub-Sampled Newton's Method for Convex Classification Problems , 2019, SDM.

[64]  R. Dudley,et al.  High Dimensional Probability VI , 2011 .

[65]  J. J. Moré,et al.  A Characterization of Superlinear Convergence and its Application to Quasi-Newton Methods , 1973 .