论文信息 - Norm matters: efficient and accurate normalization schemes in deep networks - 字舞流文

Norm matters: efficient and accurate normalization schemes in deep networks

Over the past few years, Batch-Normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several shortcomings that hindered its use for certain tasks. In this work, we present a novel view on the purpose and function of normalization methods and weight-decay, as tools to decouple weights' norm from the underlying optimized objective. This property highlights the connection between practices such as normalization, weight decay and learning-rate adjustments. We suggest several alternatives to the widely used $L^2$ batch-norm, using normalization in $L^1$ and $L^\infty$ spaces that can substantially improve numerical stability in low-precision implementations as well as provide computational and memory benefits. We demonstrate that such methods enable the first batch-norm alternative to work for half-precision implementations. Finally, we suggest a modification to weight-normalization, which improves its performance on large-scale tasks.

Elad Hoffer | Ron Banner | Daniel Soudry | Itay Golan | Daniel Soudry | Elad Hoffer | Ron Banner | Itay Golan

[1] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Ron Meir,et al. Expectation Backpropagation: Parameter-Free Training of Multilayer Neural Networks with Continuous or Discrete Weights , 2014, NIPS.

[3] Boris Ginsburg,et al. Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification , 2017, ArXiv.

[4] Kaiming He,et al. Group Normalization , 2018, ECCV.

[5] Sergey Ioffe,et al. Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[6] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7] Hao Li,et al. On the effect of Batch Normalization and Weight Normalization in Generative Adversarial Networks , 2017, ArXiv.

[8] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[9] Eriko Nurvitadhi,et al. Accelerating Deep Convolutional Networks using low-precision and sparsity , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Anders Krogh,et al. A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[11] Arild Nøkland,et al. Shifting Mean Activation Towards Zero with Bipolar Activation Functions , 2017, ICLR.

[12] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[13] Ran El-Yaniv,et al. Binarized Neural Networks , 2016, ArXiv.

[14] M. Simon. Probability distributions involving Gaussian random variables : a handbook for engineers and scientists , 2002 .

[15] Aaron C. Courville,et al. Recurrent Batch Normalization , 2016, ICLR.

[16] Sitao Xiang,et al. On the Effects of Batch and Weight Normalization in Generative Adversarial Networks , 2017 .

[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[18] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[19] Lorenzo Porzi,et al. In-place Activated BatchNorm for Memory-Optimized Training of DNNs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[22] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[23] Xin Wang,et al. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[24] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.

[25] S. Bos,et al. Using weight decay to optimize the generalization ability of a perceptron , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[26] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[27] Yuan Xie,et al. $L1$ -Norm Batch Normalization for Efficient Training of Deep Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[28] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[29] Yoshua Bengio,et al. Training deep neural networks with low precision multiplications , 2014 .

[30] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[31] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[33] Venu Govindaraju,et al. Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks , 2016, ICML.

[34] Andrea Vedaldi,et al. Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[35] Pradeep Dubey,et al. Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[36] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37] Siegfried Bös. Optimal Weight Decay in a Perceptron , 1996, ICANN.

[38] Yuichi Yoshida,et al. Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[39] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[40] Twan van Laarhoven,et al. L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[41] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[42] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[43] Lei Huang,et al. Projection Based Weight Normalization for Deep Neural Networks , 2017, ArXiv.