Regularizing activations in neural networks via distribution matching with the Wasserstein metric

Regularization and normalization have become indispensable components in training deep neural networks, resulting in faster training and improved generalization performance. We propose the projected error function regularization loss (PER) that encourages activations to follow the standard normal distribution. PER randomly projects activations onto one-dimensional space and computes the regularization loss in the projected space. PER is similar to the Pseudo-Huber loss in the projected space, thus taking advantage of both $L^1$ and $L^2$ regularization losses. Besides, PER can capture the interaction between hidden units by projection vector drawn from a unit sphere. By doing so, PER minimizes the upper bound of the Wasserstein distance of order one between an empirical distribution of activations and the standard normal distribution. To the best of the authors' knowledge, this is the first work to regularize activations via distribution matching in the probability distribution space. We evaluate the proposed method on the image classification task and the word-level language modeling task.

[1]  Hakan Bilen,et al.  Mode Normalization , 2018, ICLR.

[2]  Nicolas Bonnotte Unidimensional and Evolution Methods for Optimal Transportation , 2013 .

[3]  Zhe Gan,et al.  Improving Sequence-to-Sequence Learning via Optimal Transport , 2019, ICLR.

[4]  Lawrence Carin,et al.  Policy Optimization as Wasserstein Gradient Flows , 2018, ICML.

[5]  Lei Huang,et al.  Decorrelated Batch Normalization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  David Rolnick,et al.  Measuring and regularizing networks in function space , 2018, ICLR.

[7]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[8]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[9]  Andrea Vedaldi,et al.  Universal representations: The missing link between faces, text, planktons, and cat breeds , 2017, ArXiv.

[10]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[11]  Thomas Hofmann,et al.  Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.

[12]  Richard Socher,et al.  Revisiting Activation Regularization for Language RNNs , 2017, ArXiv.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  Carla P. Gomes,et al.  Understanding Batch Normalization , 2018, NeurIPS.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[18]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[19]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[20]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[21]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[22]  Mubarak Shah,et al.  Training Faster by Separating Modes of Variation in Batch-Normalized Models , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Gustavo K. Rohde,et al.  Sliced Wasserstein Auto-Encoders , 2018, ICLR.

[24]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[25]  Nicol N. Schraudolph,et al.  Accelerated Gradient Descent by Factor-Centering Decomposition , 1998 .

[26]  Quoc V. Le,et al.  DropBlock: A regularization method for convolutional networks , 2018, NeurIPS.

[27]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[28]  Thomas Hofmann,et al.  Towards a Theoretical Understanding of Batch Normalization , 2018, ArXiv.

[29]  Jascha Sohl-Dickstein,et al.  A Mean Field Theory of Batch Normalization , 2019, ICLR.

[30]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[31]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[32]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[33]  Wei Xiong,et al.  Regularizing Deep Convolutional Neural Networks with a Structured Decorrelation Constraint , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[34]  Hermann Ney,et al.  Mean-normalized stochastic gradient for large-scale deep learning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Tomaso A. Poggio,et al.  Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning , 2016, ArXiv.

[36]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[37]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[38]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[39]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[40]  Lior Wolf,et al.  Regularizing by the Variance of the Activations' Sample-Variances , 2018, NeurIPS.

[41]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[42]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[44]  Lei Huang,et al.  Iterative Normalization: Beyond Standardization Towards Efficient Whitening , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[46]  Ross B. Girshick,et al.  Reducing Overfitting in Deep Networks by Decorrelating Representations , 2015, ICLR.

[47]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[48]  Ohad Shamir,et al.  Failures of Gradient-Based Deep Learning , 2017, ICML.

[49]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[50]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[51]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[52]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[53]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[54]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[55]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[56]  Brian McWilliams,et al.  The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.

[57]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[58]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[59]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[60]  Elad Hoffer,et al.  Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.