SHARE: Regularization for Deep Learning

Regularization is a big issue for training deep neural networks. In this paper, we propose a new information-theory-based regularization scheme named SHARE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and labels in the data fitting term. Our second contribution is to derive a stochastic version of the regularizer compatible with deep learning, resulting in a tractable training scheme. We empirically validate the efficiency of our approach to improve classification performances compared to common regularization schemes on several standard architectures.

[1]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[2]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[3]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[4]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[5]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[6]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[7]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[8]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[9]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[14]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[15]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[16]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[17]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[18]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[19]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[20]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theoretical Computer Science.

[21]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[22]  Benjamin Graham,et al.  Fractional Max-Pooling , 2014, ArXiv.

[23]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  S. Mallat,et al.  Invariant Scattering Convolution Networks , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[30]  Matthieu Cord,et al.  WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Bernhard C. Geiger,et al.  How (Not) To Train Your Neural Network Using the Information Bottleneck Principle , 2018, ArXiv.

[32]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.