A Quantitative Analysis of the Effect of Batch Normalization on Gradient Descent

Despite its empirical success and recent theoretical progress, there generally lacks a quantitative analysis of the effect of batch normalization (BN) on the convergence and stability of gradient descent. In this paper, we provide such an analysis on the simple problem of ordinary least squares (OLS). Since precise dynamical properties of gradient descent (GD) is completely known for the OLS problem, it allows us to isolate and compare the additional effects of BN. More precisely, we show that unlike GD, gradient descent with BN (BNGD) converges for arbitrary learning rates for the weights, and the convergence remains linear under mild conditions. Moreover, we quantify two different sources of acceleration of BNGD over GD -- one due to over-parameterization which improves the effective condition number and another due having a large range of learning rates giving rise to fast descent. These phenomena set BNGD apart from GD and could account for much of its robustness properties. These findings are confirmed quantitatively by numerical experiments, which further show that many of the uncovered properties of BNGD in OLS are also observed qualitatively in more complex supervised learning problems.

[1]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[2]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[3]  S. Ponomarev Submersions and preimages of sets of measure zero , 1987 .

[4]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[5]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[6]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[7]  Georgios Piliouras,et al.  Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions , 2016, ITCS.

[8]  M. Shub Global Stability of Dynamical Systems , 1986 .

[9]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[12]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[13]  Diego Klabjan,et al.  Convergence Analysis of Batch Normalization for Deep Neural Nets , 2017, ArXiv.

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NIPS 2018.

[18]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[19]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[20]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[21]  Thomas Hofmann,et al.  Towards a Theoretical Understanding of Batch Normalization , 2018, ArXiv.

[22]  Harold R. Parks,et al.  A Primer of Real Analytic Functions , 1992 .