The Heavy-Tail Phenomenon in SGD

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $\eta$ to the batch size $b$, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the eigenspectra of the network weights. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a \emph{heavy-tailed} stationary distribution. We rigorously prove this claim in the setting of linear regression: we show that even in a simple quadratic optimization problem with independent and identically distributed Gaussian data, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We finally support our theory with experiments conducted on both synthetic data and neural networks. To our knowledge, these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.

[1]  Masashi Sugiyama,et al.  A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Sebastian Mentemeier,et al.  Tail behaviour of stationary solutions of random difference equations: the case of regular matrices , 2010, 1009.1728.

[4]  Michael W. Mahoney,et al.  Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[5]  Adel Mohammadpour,et al.  On estimating the tail index and the spectral measure of multivariate $$\alpha $$α-stable distributions , 2015 .

[6]  R. Srikant,et al.  Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[7]  H. Bauke Parameter estimation for power-law distributions by maximum likelihood methods , 2007, 0704.1867.

[8]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[9]  Thomas Mikosch,et al.  Stochastic Models with Power-Law Tails , 2016 .

[10]  H. Kesten Random difference equations and Renewal theory for products of random matrices , 1973 .

[11]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[12]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[13]  I. Pavlyukevich Cooling down Lévy flights , 2007, cond-mat/0701651.

[14]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[15]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[16]  V. Zolotarev,et al.  Chance and Stability, Stable Distributions and Their Applications , 1999 .

[17]  Michael I. Jordan,et al.  Stochastic Gradient and Langevin Processes , 2019, ICML.

[18]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[19]  C. Villani Optimal Transport: Old and New , 2008 .

[20]  D. Buraczewski,et al.  ASYMPTOTICS OF STATIONARY SOLUTIONS OF MULTIVARIATE STOCHASTIC RECURSIONS WITH HEAVY TAILED INPUTS AND RELATED , 2010, 1011.1685.

[21]  Rachel A. Ward,et al.  Concentration inequalities for random matrix products , 2019, Linear Algebra and its Applications.

[22]  P. Levy Théorie de l'addition des variables aléatoires , 1955 .

[23]  Sashank J. Reddi,et al.  Why ADAM Beats SGD for Attention Models , 2019, ArXiv.

[24]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[25]  George Tzagkarakis,et al.  Compressive Sensing of Temporally Correlated Sources Using Isotropic Multivariate Stable Laws , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[26]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[27]  Andreas Veit,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[28]  G. Alsmeyer On the stationary tail index of iterated random Lipschitz functions , 2014, 1409.2663.

[29]  Pan Zhou,et al.  Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning , 2020, NeurIPS.

[30]  Michel L. Goldstein,et al.  Problems with fitting to the power-law distribution , 2004, cond-mat/0402322.

[31]  G. B. Arous,et al.  The Spectrum of Heavy Tailed Random Matrices , 2007, 0707.2159.

[32]  D. Buraczewski,et al.  On the rate of convergence in the Kesten renewal theorem , 2015 .

[33]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[34]  C. Goldie IMPLICIT RENEWAL THEORY AND TAILS OF SOLUTIONS OF RANDOM EQUATIONS , 1991 .

[35]  Joel A. Tropp,et al.  Matrix Concentration for Products , 2020, Foundations of Computational Mathematics.

[36]  Edgar Dobriban,et al.  The Implicit Regularization of Stochastic Gradient Flow for Least Squares , 2020, ICML.

[37]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[38]  Michael W. Mahoney,et al.  Multiplicative noise and heavy tails in stochastic optimization , 2020, ICML.

[39]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[40]  Mert Gürbüzbalaban,et al.  Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration , 2018, Oper. Res..

[41]  C. Kluppelberg,et al.  Fractional L\'{e}vy-driven Ornstein--Uhlenbeck processes and stochastic differential equations , 2011, 1102.1830.

[42]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[43]  Vygantas Paulauskas,et al.  Once more on comparison of tail index estimators , 2011, 1104.1242.

[44]  Praneeth Netrapalli,et al.  Non-Gaussianity of Stochastic Gradient Noise , 2019, ArXiv.

[45]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[46]  Gaël Richard,et al.  On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks , 2019, ArXiv.

[47]  Charles M. Newman,et al.  The distribution of Lyapunov exponents: Exact results for random matrices , 1986 .

[48]  Gaël Richard,et al.  First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , 2019, NeurIPS.

[49]  Persi Diaconis,et al.  Iterated Random Functions , 1999, SIAM Rev..

[50]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[51]  Alain Durmus,et al.  Quantitative Propagation of Chaos for SGD in Wide Neural Networks , 2020, NeurIPS.

[52]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[53]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[54]  Murat A. Erdogdu,et al.  Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks , 2020, ArXiv.

[55]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[56]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[57]  Mariusz Mirek Heavy tail phenomenon and convergence to stable laws for iterated Lipschitz maps , 2009, 0907.2261.

[58]  Sebastian Mentemeier,et al.  On multidimensional Mandelbrot cascades , 2014 .

[59]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, ArXiv.

[60]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.