Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence

We investigate the reasons for the performance degradation incurred with batchindependent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network’s pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique “Proxy Normalization” that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization’s behavior and consistently matches or exceeds its performance.

[1]  Sanjeev Arora,et al.  An Exponential Learning Rate Schedule for Deep Learning , 2020, ICLR.

[2]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[3]  Hao Li,et al.  On the effect of Batch Normalization and Weight Normalization in Generative Adversarial Networks , 2017, ArXiv.

[4]  David Rolnick,et al.  Complexity of Linear Regions in Deep Networks , 2019, ICML.

[5]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[6]  Sanjeev Arora,et al.  Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.

[7]  Kevin Smith,et al.  Bayesian Uncertainty Estimation for Batch Normalized Deep Networks , 2018, ICML.

[8]  Boris Flach,et al.  Stochastic Normalizations as Bayesian Learning , 2018, ACCV.

[9]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[10]  Graham W. Taylor,et al.  Batch Normalization is a Cause of Adversarial Vulnerability , 2019, ArXiv.

[11]  Michael James,et al.  Online Normalization for Training Neural Networks , 2019, NeurIPS.

[12]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[13]  Venu Govindaraju,et al.  Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks , 2016, ICML.

[14]  Jonathon Shlens,et al.  Accelerating Training of Deep Neural Networks with a Standardization Loss , 2019, ArXiv.

[15]  Kihyuk Sohn,et al.  Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units , 2017, AAAI.

[16]  Alan L. Yuille,et al.  Intriguing Properties of Adversarial Training at Scale , 2020, ICLR.

[17]  Andrea Vedaldi,et al.  Texture Networks: Feed-forward Synthesis of Textures and Stylized Images , 2016, ICML.

[18]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[19]  Jascha Sohl-Dickstein,et al.  A Mean Field Theory of Batch Normalization , 2019, ICLR.

[20]  Pascal Vincent,et al.  Recurrent Normalization Propagation , 2017, ICLR.

[21]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[22]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[23]  Shankar Krishnan,et al.  Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Justin Johnson,et al.  Rethinking "Batch" in BatchNorm , 2021, ArXiv.

[25]  Dawn Xiaodong Song,et al.  Gradients explode - Deep Networks are shallow - ResNet explained , 2017, ICLR.

[26]  Joscha Bach,et al.  Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization , 2019, ArXiv.

[27]  Bhiksha Raj,et al.  Is normalization indispensable for training deep neural network? , 2020, NeurIPS.

[28]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[29]  Ruimao Zhang,et al.  Differentiable Dynamic Normalization for Learning Deep Representation , 2019, ICML.

[30]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[31]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[32]  Andrea Vedaldi,et al.  Improved Texture Networks: Maximizing Quality and Diversity in Feed-Forward Stylization and Texture Synthesis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Francis Bach,et al.  Batch normalization provably avoids ranks collapse for randomly initialised deep networks , 2020, NeurIPS.

[34]  Quoc V. Le,et al.  Evolving Normalization-Activation Layers , 2020, NeurIPS.

[35]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[37]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Boris Ginsburg,et al.  Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification , 2017, ArXiv.

[39]  Quoc V. Le,et al.  Adversarial Examples Improve Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Michael I. Jordan,et al.  Transferable Normalization: Towards Improving Transferability of Deep Neural Networks , 2019, NeurIPS.

[43]  Guodong Zhang,et al.  Three Mechanisms of Weight Decay Regularization , 2018, ICLR.

[44]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Twan van Laarhoven,et al.  L2 Regularization versus Batch and Weight Normalization , 2017, ArXiv.

[46]  Zach Eaton-Rosen,et al.  Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training , 2021, ArXiv.

[47]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[48]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[49]  Lei Huang,et al.  Group Whitening: Balancing Learning Efficiency and Representational Capacity , 2020, ArXiv.

[50]  Jascha Sohl-Dickstein,et al.  Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence , 2020, ArXiv.

[51]  Arthur Jacot,et al.  Freeze and Chaos for DNNs: an NTK view of Batch Normalization, Checkerboard and Boundary Effects , 2019, ArXiv.

[52]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[54]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[55]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[56]  Robert P. Dick,et al.  Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning , 2021, ArXiv.

[57]  Jian Sun,et al.  Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization , 2020, ICLR.

[58]  Hakan Bilen,et al.  Mode Normalization , 2018, ICLR.

[59]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[60]  Michael J. Dinneen,et al.  Four Things Everyone Should Know to Improve Batch Normalization , 2019, ICLR.

[61]  Samuel L. Smith,et al.  Characterizing signal propagation to close the performance gap in unnormalized ResNets , 2021, ICLR.

[62]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[63]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[64]  Lei Huang,et al.  Centered Weight Normalization in Accelerating Training of Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[66]  Zhanxing Zhu,et al.  Spherical Motion Dynamics of Deep Neural Networks with Batch Normalization and Weight Decay , 2020, ArXiv.

[67]  K. Simonyan,et al.  High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[68]  Rico Sennrich,et al.  Root Mean Square Layer Normalization , 2019, NeurIPS.

[69]  Ping Luo,et al.  Differentiable Learning-to-Normalize via Switchable Normalization , 2018, ICLR.

[70]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[71]  Boris Flach,et al.  Normalization of Neural Networks using Analytic Variance Propagation , 2018, ArXiv.

[72]  Ping Luo,et al.  Towards Understanding Regularization in Batch Normalization , 2018, ICLR.

[73]  Samuel L. Smith,et al.  Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks , 2020, NeurIPS.

[74]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[75]  Thomas Hofmann,et al.  Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.

[76]  Renjie Liao,et al.  Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes , 2016, ICLR.

[77]  David Rolnick,et al.  Deep ReLU Networks Have Surprisingly Few Activation Patterns , 2019, NeurIPS.

[78]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[79]  Carlo Luschi,et al.  Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.

[80]  Yann Dauphin,et al.  Deconstructing the Regularization of BatchNorm , 2021, ICLR.

[81]  Antoine Labatie,et al.  Characterizing Well-Behaved vs. Pathological Deep Neural Networks , 2018, ICML.

[82]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[83]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[84]  Minhyung Cho,et al.  Riemannian approach to batch normalization , 2017, NIPS.

[85]  Elad Hoffer,et al.  Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.

[86]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[87]  Serge J. Belongie,et al.  Residual Networks Behave Like Ensembles of Relatively Shallow Networks , 2016, NIPS.