Sub-Sampled Newton Methods II: Local Convergence Rates

Many data-fitting applications require the solution of an optimization problem involving a sum of large number of functions of high dimensional parameter. Here, we consider the problem of minimizing a sum of $n$ functions over a convex constraint set $\mathcal{X} \subseteq \mathbb{R}^{p}$ where both $n$ and $p$ are large. In such problems, sub-sampling as a way to reduce $n$ can offer great amount of computational efficiency. Within the context of second order methods, we first give quantitative local convergence results for variants of Newton's method where the Hessian is uniformly sub-sampled. Using random matrix concentration inequalities, one can sub-sample in a way that the curvature information is preserved. Using such sub-sampling strategy, we establish locally Q-linear and Q-superlinear convergence rates. We also give additional convergence results for when the sub-sampled Hessian is regularized by modifying its spectrum or Levenberg-type regularization. Finally, in addition to Hessian sub-sampling, we consider sub-sampling the gradient as way to further reduce the computational complexity per iteration. We use approximate matrix multiplication results from randomized numerical linear algebra (RandNLA) to obtain the proper sampling strategy and we establish locally R-linear convergence rates. In such a setting, we also show that a very aggressive sample size increase results in a R-superlinearly convergent algorithm. While the sample size depends on the condition number of the problem, our convergence rates are problem-independent, i.e., they do not depend on the quantities related to the problem. Hence, our analysis here can be used to complement the results of our basic framework from the companion paper, [38], by exploring algorithmic trade-offs that are important in practice.

[1]  Kenneth Levenberg A METHOD FOR THE SOLUTION OF CERTAIN NON – LINEAR PROBLEMS IN LEAST SQUARES , 1944 .

[2]  D. Marquardt An Algorithm for Least-Squares Estimation of Nonlinear Parameters , 1963 .

[3]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[4]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[5]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[6]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[7]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[10]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[11]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[12]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[13]  E. Haber,et al.  On optimization techniques for solving nonlinear inverse problems , 2000 .

[14]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[15]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[16]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[17]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[18]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[19]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[20]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[21]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[22]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[23]  H. Robbins A Stochastic Approximation Method , 1951 .

[24]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[25]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[26]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[27]  D. Bernstein Matrix Mathematics: Theory, Facts, and Formulas , 2009 .

[28]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[29]  S. V. N. Vishwanathan,et al.  A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning , 2008, J. Mach. Learn. Res..

[30]  Vincent Nesme,et al.  Note on sampling without replacing from a finite collection of matrices , 2010, ArXiv.

[31]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[32]  U. Ascher,et al.  Adaptive and stochastic algorithms for EIT and DC resistivity problems with piecewise constant solutions and many measurements , 2011 .

[33]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[34]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[35]  Jacek Gondzio,et al.  Exploiting separability in large-scale linear support vector machine training , 2011, Comput. Optim. Appl..

[36]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[37]  Ohad Shamir,et al.  Better Mini-Batch Algorithms via Accelerated Gradient Methods , 2011, NIPS.

[38]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[39]  Eldad Haber,et al.  An Effective Method for Parameter Estimation with PDE Constraints with Multiple Right-Hand Sides , 2012, SIAM J. Optim..

[40]  Felix J. Herrmann,et al.  Robust inversion, dimensionality reduction, and randomized sampling , 2012, Math. Program..

[41]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[42]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[43]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[44]  Georg Heigold,et al.  An empirical study of learning rates in deep neural networks for speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[46]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[47]  E. Haber,et al.  The lost honor of ℓ2-based regularization , 2013 .

[48]  Uri M. Ascher,et al.  Data completion and stochastic algorithms for PDE inversion problems with many measurements , 2013, ArXiv.

[49]  Michael I. Jordan,et al.  Matrix concentration inequalities via the method of exchangeable pairs , 2012, 1201.6002.

[50]  M. Girolami,et al.  Solving large-scale PDE-constrained Bayesian inverse problems with Riemann manifold Hamiltonian Monte Carlo , 2014, 1407.1517.

[51]  Uri M. Ascher,et al.  Stochastic Algorithms for Inverse Problems Involving PDEs and many Measurements , 2014, SIAM J. Sci. Comput..

[52]  Eldad Haber,et al.  Simultaneous Source for non-uniform data variance and missing data , 2014, ArXiv.

[53]  Alexander J. Smola,et al.  Efficient mini-batch training for stochastic optimization , 2014, KDD.

[54]  Uri M. Ascher,et al.  Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables , 2014, SIAM/ASA J. Uncertain. Quantification.

[55]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[56]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[57]  Ohad Shamir,et al.  Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity , 2015, ICML.

[58]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods I: Globally Convergent Algorithms , 2016, ArXiv.

[59]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[60]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..