Shade: Information-Based Regularization for Deep Learning

Regularization is a big issue for training deep neural networks. In this paper, we propose a new information-theory-based regularization scheme named SHADE for SHAnnon DEcay. The originality of the approach is to define a prior based on conditional entropy, which explicitly decouples the learning of invariant representations in the regularizer and the learning of correlations between inputs and labels in the data fitting term. Our second contribution is to derive a stochastic version of the regularizer compatible with deep learning, resulting in a tractable training scheme. We empirically validate the efficiency of our approach to improve classification performances compared to standard regularization schemes on several standard architectures.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[3]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[4]  Matthieu Cord,et al.  Top-Down Regularization of Deep Belief Networks , 2013, NIPS.

[5]  Benjamin Graham,et al.  Fractional Max-Pooling , 2014, ArXiv.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Matthieu Cord,et al.  WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Matthieu Cord,et al.  WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[10]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[11]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[12]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[13]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[14]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[15]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[16]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[17]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Matthieu Cord,et al.  Deep Neural Networks Under Stress , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[19]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[20]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[21]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theoretical Computer Science.

[22]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[23]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[24]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[25]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[26]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theor. Comput. Sci..

[27]  Matthieu Cord,et al.  Max-min convolutional neural networks for image classification , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[28]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[29]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[30]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[31]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[32]  Bernhard C. Geiger,et al.  How (Not) To Train Your Neural Network Using the Information Bottleneck Principle , 2018, ArXiv.

[33]  Matthieu Cord,et al.  Learning Deep Hierarchical Visual Feature Coding , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[34]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[37]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[38]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[39]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[40]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[41]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[42]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .