论文信息 - The Heavy-Tail Phenomenon in SGD

The Heavy-Tail Phenomenon in SGD

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $\eta$ to the batch size $b$, which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the eigenspectra of the network weights. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a \emph{heavy-tailed} stationary distribution. We rigorously prove this claim in the setting of linear regression: we show that even in a simple quadratic optimization problem with independent and identically distributed Gaussian data, the iterates can be heavy-tailed with infinite variance. We further characterize the behavior of the tails with respect to algorithm parameters, the dimension, and the curvature. We then translate our results into insights about the behavior of SGD in deep learning. We finally support our theory with experiments conducted on both synthetic data and neural networks. To our knowledge, these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.

Umut Simsekli | Mert Gürbüzbalaban | Lingjiong Zhu

[1] Masashi Sugiyama,et al. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.

[2] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3] Sebastian Mentemeier,et al. Tail behaviour of stationary solutions of random difference equations: the case of regular matrices , 2010, 1009.1728.

[4] Michael W. Mahoney,et al. Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[5] Adel Mohammadpour,et al. On estimating the tail index and the spectral measure of multivariate $$\alpha $$α-stable distributions , 2015 .

[6] R. Srikant,et al. Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning , 2019, COLT.

[7] H. Bauke. Parameter estimation for power-law distributions by maximum likelihood methods , 2007, 0704.1867.

[8] Mark E. J. Newman,et al. Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[9] Thomas Mikosch,et al. Stochastic Models with Power-Law Tails , 2016 .

[10] H. Kesten. Random difference equations and Renewal theory for products of random matrices , 1973 .

[11] F. Bach,et al. Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.