On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

The noise in stochastic gradient descent (SGD), caused by minibatch sampling, remains poorly understood despite its enormous practical importance in offering good training efficiency and generalization ability. In this work, we study the minibatch noise in SGD. Motivated by the observation that minibatch sampling does not always cause a fluctuation, we set out to find the conditions that cause minibatch noise to emerge. We first derive the analytically solvable results for linear regression under various settings, which are compared to the commonly used approximations that are used to understand SGD noise. We show that some degree of mismatch between model and data complexity is needed in order for SGD to “cause” a noise, and that such mismatch may be due to the existence of static noise in the labels, in the input, the use of regularization, or underparametrization. Our results motivate a more accurate general formulation to describe minibatch noise.

[1]  P. Stoica,et al.  On the expectation of the product of four matrix-valued Gaussian random variables , 1988 .

[2]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[3]  M. Levy,et al.  POWER LAWS ARE LOGARITHMIC BOLTZMANN LAWS , 1996, adap-org/9607001.

[4]  A. Benjamin,et al.  Proofs that Really Count: The Art of Combinatorial Proof , 2003 .

[5]  Ioana Dumitriu,et al.  Path Counting and Random Matrix Theory , 2003, Electron. J. Comb..

[6]  D. Glass Proofs That Really Count: The Art of Combinatorial Proof , 2004 .

[7]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[8]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[9]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[10]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[11]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[12]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[13]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[14]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[16]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[17]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[18]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[19]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[20]  Yoshua Bengio,et al.  A Walk with SGD , 2018, ArXiv.

[21]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[22]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[23]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[24]  Gaël Richard,et al.  First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , 2019, NeurIPS.

[25]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[26]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[27]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, Information and Inference: A Journal of the IMA.

[28]  Praneeth Netrapalli,et al.  Non-Gaussianity of Stochastic Gradient Noise , 2019, ArXiv.

[29]  Zhi-Ming Ma,et al.  Dynamic of Stochastic Gradient Descent with State-Dependent Noise , 2020, ArXiv.

[30]  Farhad Pourpanah,et al.  Recent advances in deep learning , 2020, International Journal of Machine Learning and Cybernetics.

[31]  Zhanxing Zhu,et al.  On the Noisy Gradient Descent that Generalizes as SGD , 2019, ICML.

[32]  Masahito Ueda,et al.  Stochastic Gradient Descent with Large Learning Rate , 2020, arXiv.org.

[33]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[34]  D. Tao,et al.  Recent advances in deep learning theory , 2020, ArXiv.

[35]  Masashi Sugiyama,et al.  A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.

[36]  Michael W. Mahoney,et al.  Multiplicative noise and heavy tails in stochastic optimization , 2020, ICML.

[37]  Colin Wei,et al.  Shape Matters: Understanding the Implicit Bias of the Noise Covariance , 2020, COLT.

[38]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.