Information Bottleneck and its Applications in Deep Learning

Information Theory (IT) has been used in Machine Learning (ML) from early days of this field. In the last decade, advances in Deep Neural Networks (DNNs) have led to surprising improvements in many applications of ML. The result has been a paradigm shift in the community toward revisiting previous ideas and applications in this new framework. Ideas from IT are no exception. One of the ideas which is being revisited by many researchers in this new era, is Information Bottleneck (IB); a formulation of information extraction based on IT. The IB is promising in both analyzing and improving DNNs. The goal of this survey is to review the IB concept and demonstrate its applications in deep learning. The information theoretic nature of IB, makes it also a good candidate in showing the more general concept of how IT can be used in ML. Two important concepts are highlighted in this narrative on the subject, i) the concise and universal view that IT provides on seemingly unrelated methods of ML, demonstrated by explaining how IB relates to minimal sufficient statistics, stochastic gradient descent, and variational auto-encoders, and ii) the common technical mistakes and problems caused by applying ideas from IT, which is discussed by a careful study of some recent methods suffering from them.

[1]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[2]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[3]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[4]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[5]  Ivor W. Tsang,et al.  Degeneration in VAE: in the Light of Fisher Information Loss , 2018, ArXiv.

[6]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[7]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[8]  R. Dudley The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes , 1967 .

[9]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theoretical Computer Science.

[10]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[11]  Yoshua Bengio,et al.  Denoising Criterion for Variational Auto-Encoding Framework , 2015, AAAI.

[12]  Naftali Tishby,et al.  An Information Theoretic Tradeoff between Complexity and Accuracy , 2003, COLT.

[13]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[14]  James Zou,et al.  How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage , 2015, IEEE Transactions on Information Theory.

[15]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[16]  Leonardo Rey Vega,et al.  The Role of the Information Bottleneck in Representation Learning , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[17]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[18]  Richard E. Blahut,et al.  Computation of channel capacity and rate-distortion functions , 1972, IEEE Trans. Inf. Theory.

[19]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[20]  Amir Yehudayoff,et al.  A Direct Sum Result for the Information Complexity of Learning , 2018, COLT.

[21]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[22]  Alexander A. Alemi,et al.  An Information-Theoretic Analysis of Deep Latent-Variable Models , 2017, ArXiv.

[23]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[24]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[25]  B. O. Koopman On distributions admitting a sufficient statistic , 1936 .

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[28]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[29]  Joan Bruna,et al.  Mathematics of Deep Learning , 2017, ArXiv.

[30]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[33]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[34]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[35]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[36]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[37]  Andrei N. Kolmogorov,et al.  On the Shannon theory of information transmission in the case of continuous signals , 1956, IRE Trans. Inf. Theory.

[39]  Dacheng Tao,et al.  An Information-Theoretic View for Deep Learning , 2018, ArXiv.

[40]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Sergio Verdú,et al.  Chaining Mutual Information and Tightening Generalization Bounds , 2018, NeurIPS.

[42]  Che-Wei Huang,et al.  Flow of Renyi information in deep neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[43]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[44]  Leonardo Rey Vega,et al.  The Role of Information Complexity and Randomization in Representation Learning , 2018, ArXiv.

[45]  Naren Ramakrishnan,et al.  Flow of Information in Feed-Forward Deep Neural Networks , 2016, ArXiv.

[46]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[47]  P. Gács,et al.  KOLMOGOROV'S CONTRIBUTIONS TO INFORMATION THEORY AND ALGORITHMIC COMPLEXITY , 1989 .

[48]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[49]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[50]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[51]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[52]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[53]  David H. Wolpert,et al.  Nonlinear Information Bottleneck , 2017, Entropy.

[54]  Bernhard C. Geiger,et al.  How (Not) To Train Your Neural Network Using the Information Bottleneck Principle , 2018, ArXiv.

[55]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[56]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[57]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.