Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.
[1]
Alex Krizhevsky,et al.
Learning Multiple Layers of Features from Tiny Images
,
2009
.
[2]
James Martens,et al.
Deep learning via Hessian-free optimization
,
2010,
ICML.
[3]
Geoffrey E. Hinton,et al.
On the importance of initialization and momentum in deep learning
,
2013,
ICML.
[4]
Sergey Ioffe,et al.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
,
2015,
ICML.
[5]
Jimmy Ba,et al.
Adam: A Method for Stochastic Optimization
,
2014,
ICLR.
[6]
Tim Salimans,et al.
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
,
2016,
NIPS.