论文信息 - Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

We investigate the reasons for the performance degradation incurred with batchindependent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network’s pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique “Proxy Normalization” that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization’s behavior and consistently matches or exceeds its performance.

Zach Eaton-Rosen | Carlo Luschi | Dominic Masters | Antoine Labatie

[1] Sanjeev Arora,et al. An Exponential Learning Rate Schedule for Deep Learning , 2020, ICLR.

[2] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[3] Hao Li,et al. On the effect of Batch Normalization and Weight Normalization in Generative Adversarial Networks , 2017, ArXiv.

[4] David Rolnick,et al. Complexity of Linear Regions in Deep Networks , 2019, ICML.

[5] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[6] Sanjeev Arora,et al. Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[7] Kevin Smith,et al. Bayesian Uncertainty Estimation for Batch Normalized Deep Networks , 2018, ICML.

[8] Boris Flach,et al. Stochastic Normalizations as Bayesian Learning , 2018, ACCV.

[9] Carla P. Gomes,et al. Understanding Batch Normalization , 2018, NeurIPS.

[10] Graham W. Taylor,et al. Batch Normalization is a Cause of Adversarial Vulnerability , 2019, ArXiv.

[11] Michael James,et al. Online Normalization for Training Neural Networks , 2019, NeurIPS.

[12] Jonathon Shlens,et al. A Learned Representation For Artistic Style , 2016, ICLR.

[13] Venu Govindaraju,et al. Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks , 2016, ICML.

[14] Jonathon Shlens,et al. Accelerating Training of Deep Neural Networks with a Standardization Loss , 2019, ArXiv.

[15] Kihyuk Sohn,et al. Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units , 2017, AAAI.

[16] Alan L. Yuille,et al. Intriguing Properties of Adversarial Training at Scale , 2020, ICLR.

[17] Andrea Vedaldi,et al. Texture Networks: Feed-forward Synthesis of Textures and Stylized Images , 2016, ICML.

[18] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[19] Jascha Sohl-Dickstein,et al. A Mean Field Theory of Batch Normalization , 2019, ICLR.

[20] Pascal Vincent,et al. Recurrent Normalization Propagation , 2017, ICLR.

[21] Kaiming He,et al. Group Normalization , 2018, ECCV.

[22] Aleksander Madry,et al. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[23] Shankar Krishnan,et al. Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Justin Johnson,et al. Rethinking "Batch" in BatchNorm , 2021, ArXiv.

[25] Dawn Xiaodong Song,et al. Gradients explode - Deep Networks are shallow - ResNet explained , 2017, ICLR.

[26] Joscha Bach,et al. Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization , 2019, ArXiv.

[27] Bhiksha Raj,et al. Is normalization indispensable for training deep neural network? , 2020, NeurIPS.

[28] Shankar Krishnan,et al. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[29] Ruimao Zhang,et al. Differentiable Dynamic Normalization for Learning Deep Representation , 2019, ICML.

[30] Tengyu Ma,et al. Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[31] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[32] Andrea Vedaldi,et al. Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Francis Bach,et al. Batch normalization provably avoids ranks collapse for randomly initialised deep networks , 2020, NeurIPS.

[34] Quoc V. Le,et al. Evolving Normalization-Activation Layers , 2020, NeurIPS.

[35] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Allan Pinkus,et al. Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[37] Leon A. Gatys,et al. Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Boris Ginsburg,et al. Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification , 2017, ArXiv.

[39] Quoc V. Le,et al. Adversarial Examples Improve Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Quoc V. Le,et al. AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42] Michael I. Jordan,et al. Transferable Normalization: Towards Improving Transferability of Deep Neural Networks , 2019, NeurIPS.

[43] Guodong Zhang,et al. Three Mechanisms of Weight Decay Regularization , 2018, ICLR.

[44] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45] Twan van Laarhoven,et al. L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[46] Zach Eaton-Rosen,et al. Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training , 2021, ArXiv.

[47] David Rolnick,et al. How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[48] Andrea Vedaldi,et al. Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[49] Lei Huang,et al. Group Whitening: Balancing Learning Efficiency and Representational Capacity , 2020, ArXiv.

[50] Jascha Sohl-Dickstein,et al. Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence , 2020, ArXiv.

[51] Arthur Jacot,et al. Freeze and Chaos for DNNs: an NTK view of Batch Normalization, Checkerboard and Boundary Effects , 2019, ArXiv.