Do Normalization Layers in a Deep ConvNet Really Need to Be Distinct?

Yes, they do. This work investigates a perspective for deep learning: whether different normalization layers in a ConvNet require different normalizers. This is the first step towards understanding this phenomenon. We allow each convolutional layer to be stacked before a switchable normalization (SN) that learns to choose a normalizer from a pool of normalization methods. Through systematic experiments in ImageNet, COCO, Cityscapes, and ADE20K, we answer three questions: (a) Is it useful to allow each normalization layer to select its own normalizer? (b) What impacts the choices of normalizers? (c) Do different tasks and datasets prefer different normalizers? Our results suggest that (1) using distinct normalizers improves both learning and generalization of a ConvNet; (2) the choices of normalizers are more related to depth and batch size, but less relevant to parameter initialization, learning rate decay, and solver; (3) different tasks and datasets have different behaviors when learning to select normalizers.

[1]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[6]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[7]  Ping Luo,et al.  Towards Understanding Regularization in Batch Normalization , 2018, ICLR.

[8]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[9]  Ping Luo,et al.  Differentiable Learning-to-Normalize via Switchable Normalization , 2018, ICLR.

[10]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ping Luo,et al.  EigenNet: Towards Fast and Structural Learning of Deep Neural Networks , 2017, IJCAI.

[12]  Ping Luo,et al.  Learning Deep Architectures via Generalized Whitened Neural Networks , 2017, ICML.

[13]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[15]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[16]  Matthew Richardson,et al.  Do Deep Convolutional Nets Really Need to be Deep and Convolutional? , 2016, ICLR.

[17]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[18]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[19]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[25]  Liang Lin,et al.  Batch Kalman Normalization: Towards Training Deep Neural Networks with Micro-Batches , 2018, ArXiv.

[26]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[27]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[29]  Renjie Liao,et al.  Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes , 2016, ICLR.

[30]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[31]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Patrice Marcotte,et al.  An overview of bilevel optimization , 2007, Ann. Oper. Res..

[33]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.