Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis

Average-case analysis computes the complexity of an algorithm averaged over all possible inputs. Compared to worst-case analysis, it is more representative of the typical behavior of an algorithm, but remains largely unexplored in optimization. One difficulty is that the analysis can depend on the probability distribution of the inputs to the model. However, we show that this is not the case for a class of large-scale problems trained with gradient descent including random least squares and one-hidden layer neural networks with random weights. In fact, the halting time exhibits a universality property: it is independent of the probability distribution. With this barrier for average-case analysis removed, we provide the first explicit average-case convergence rates showing a tighter complexity not captured by traditional worst-case analysis. Finally, numerical simulations suggest this universality property holds for a more general class of algorithms and problems.

[1]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[2]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[3]  R. Durrett Probability: Theory and Examples , 1993 .

[4]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[5]  J. Demmel The Probability That a Numerical, Analysis Problem Is Difficult , 2013 .

[6]  Walid Hachem,et al.  Large complex correlated Wishart matrices: Fluctuations and asymptotic independence at the edges , 2014, 1409.7548.

[7]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[8]  T. Tao,et al.  Random Matrices: the Distribution of the Smallest Singular Values , 2009, 0903.0614.

[9]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[10]  Michael J. Todd,et al.  Probabilistic Models for Linear Programming , 1991, Math. Oper. Res..

[11]  Prateek Jain,et al.  Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form , 2018, COLT.

[12]  K. Borgwardt A probabilistic analysis of the simplex method , 1986 .

[13]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[14]  R. H. Myers,et al.  Probability and Statistics for Engineers and Scientists , 1978 .

[15]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[16]  R. Varga,et al.  Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order Richardson iterative methods , 1961 .

[17]  Jun Yin,et al.  Anisotropic local laws for random matrices , 2014, 1410.3516.

[18]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[19]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[20]  H. Rutishauser Theory of Gradient Methods , 1959 .

[21]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[22]  J. W. Silverstein,et al.  No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices , 1998 .

[23]  Thomas Trogdon,et al.  Smoothed Analysis for the Conjugate Gradient Algorithm , 2016, 1605.06438.

[24]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[25]  Z. Bai,et al.  CLT for linear spectral statistics of large dimensional sample covariance matrices with dependent data , 2017, Statistical Papers.

[26]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[27]  Zhenyu Liao,et al.  The Dynamics of Learning: A Random Matrix Approach , 2018, ICML.

[28]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[29]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[30]  J. W. Silverstein,et al.  EXACT SEPARATION OF EIGENVALUES OF LARGE DIMENSIONAL SAMPLE COVARIANCE MATRICES , 1999 .

[31]  T. Tao Topics in Random Matrix Theory , 2012 .

[32]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[33]  Michael W. Mahoney,et al.  Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..

[34]  Thomas Trogdon,et al.  Universality in numerical computation with random data. Case studies, analytic results and some speculations , 2016 .

[35]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[36]  Mert Pilanci,et al.  Optimal Randomized First-Order Methods for Least-Squares Problems , 2020, ICML.

[37]  Thomas Trogdon,et al.  The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deteriministic , 2019, Quarterly of Applied Mathematics.

[38]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[39]  Thomas Trogdon,et al.  Universality in numerical computation with random data: Case studies and analytical results , 2019 .

[40]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[41]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[42]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[43]  Stephen Smale,et al.  On the average number of steps of the simplex method of linear programming , 1983, Math. Program..

[44]  J. W. Silverstein,et al.  Spectral Analysis of Large Dimensional Random Matrices , 2009 .

[45]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[46]  L. Trefethen,et al.  Average-case stability of Gaussian elimination , 1990 .

[47]  P. Deift,et al.  Universality in numerical computations with random data , 2014, Proceedings of the National Academy of Sciences.

[48]  Yann LeCun,et al.  Universal halting times in optimization and machine learning , 2015, 1511.06444.

[49]  A. Edelman Eigenvalues and condition numbers of random matrices , 1988 .

[50]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[51]  Lucas Benigni,et al.  Eigenvalue distribution of nonlinear models of random matrices , 2019, ArXiv.

[52]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[53]  Teodoro Collin RANDOM MATRIX THEORY , 2016 .

[54]  B. Fischer Polynomial Based Iteration Methods for Symmetric Linear Systems , 1996 .

[55]  Adrien B. Taylor,et al.  Smooth strongly convex interpolation and exact worst-case performance of first-order methods , 2015, Mathematical Programming.

[56]  P. Deift,et al.  How long does it take to compute the eigenvalues of a random, symmetric matrix? , 2012, 1203.4635.

[57]  D. A. Flanders,et al.  Numerical Determination of Fundamental Modes , 1950 .

[58]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[59]  Thomas Trogdon,et al.  Universality for the Conjugate Gradient and MINRES Algorithms on Sample Covariance Matrices , 2020, Communications on Pure and Applied Mathematics.