The Role of the Information Bottleneck in Representation Learning

A grand challenge in representation learning is the development of computational algorithms that learn the different explanatory factors of variation behind high-dimensional data. Encoder models are usually determined to optimize performance on training data when the real objective is to generalize well to other (unseen) data. Although numerical evidence suggests that noise injection at the level of representations might improve the generalization ability of the resulting encoders, an information-theoretic justification of this principle remains elusive. In this work, we derive an upper bound to the so-called generalization gap corresponding to the cross-entropy loss and show that when this bound times a suitable multiplier and the empirical risk are minimized jointly, the problem is equivalent to optimizing the Information Bottleneck objective with respect to the empirical data-distribution. We specialize our general conclusions to analyze the dropout regularization method in deep neural networks, explaining how this regularizer helps to decrease the generalization gap.

[1]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[2]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[3]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[4]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[5]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[7]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[8]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[9]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[10]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[11]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[13]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theoretical Computer Science.

[14]  James Zou,et al.  How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage , 2015, IEEE Transactions on Information Theory.

[15]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[16]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .