The Information Bottleneck Problem and its Applications in Machine Learning

Inference capabilities of machine learning (ML) systems skyrocketed in recent years, now playing a pivotal role in various aspect of society. The goal in statistical learning is to use data to obtain simple algorithms for predicting a random variable <inline-formula> <tex-math notation="LaTeX">$Y$ </tex-math></inline-formula> from a correlated observation <inline-formula> <tex-math notation="LaTeX">$X$ </tex-math></inline-formula>. Since the dimension of <inline-formula> <tex-math notation="LaTeX">$X$ </tex-math></inline-formula> is typically huge, computationally feasible solutions should summarize it into a lower-dimensional feature vector <inline-formula> <tex-math notation="LaTeX">$T$ </tex-math></inline-formula>, from which <inline-formula> <tex-math notation="LaTeX">$Y$ </tex-math></inline-formula> is predicted. The algorithm will successfully make the prediction if <inline-formula> <tex-math notation="LaTeX">$T$ </tex-math></inline-formula> is a good proxy of <inline-formula> <tex-math notation="LaTeX">$Y$ </tex-math></inline-formula>, despite the said dimensionality-reduction. A myriad of ML algorithms (mostly employing deep learning (DL)) for finding such representations <inline-formula> <tex-math notation="LaTeX">$T$ </tex-math></inline-formula> based on real-world data are now available. While these methods are effective in practice, their success is hindered by the lack of a comprehensive theory to explain it. The information bottleneck (IB) theory recently emerged as a bold information-theoretic paradigm for analyzing DL systems. Adopting mutual information as the figure of merit, it suggests that the best representation <inline-formula> <tex-math notation="LaTeX">$T$ </tex-math></inline-formula> should be maximally informative about <inline-formula> <tex-math notation="LaTeX">$Y$ </tex-math></inline-formula> while minimizing the mutual information with <inline-formula> <tex-math notation="LaTeX">$X$ </tex-math></inline-formula>. In this tutorial we survey the information-theoretic origins of this abstract principle, and its recent impact on DL. For the latter, we cover implications of the IB problem on DL theory, as well as practical algorithms inspired by it. Our goal is to provide a unified and cohesive description. A clear view of current knowledge is important for further leveraging IB and other information-theoretic ideas to study DL models.

[1]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[2]  Chao Zhao,et al.  Neural Entropic Estimation: A faster path to mutual information estimation , 2019, ArXiv.

[3]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[4]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[6]  Brian Kingsbury,et al.  Estimating Information Flow in Deep Neural Networks , 2018, ICML.

[7]  Shlomo Shamai,et al.  On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views , 2020, Entropy.

[8]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[9]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[10]  B. O. Koopman On distributions admitting a sufficient statistic , 1936 .

[11]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[12]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[13]  Richard E. Blahut,et al.  Computation of channel capacity and rate-distortion functions , 1972, IEEE Trans. Inf. Theory.

[14]  F. Alajaji,et al.  Lectures Notes in Information Theory , 2000 .

[15]  Boris Tsybakov,et al.  Information transmission with additional noise , 1962, IRE Trans. Inf. Theory.

[16]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[17]  Shenghua Gao,et al.  Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane , 2018, ECCV.

[18]  Nicolas Macris,et al.  Entropy and mutual information in models of deep neural networks , 2018, NeurIPS.

[19]  Milan Cvitkovic,et al.  Minimal Achievable Sufficient Statistic Learning , 2019, ICML.

[20]  Tsachy Weissman,et al.  Multiterminal Source Coding Under Logarithmic Loss , 2011, IEEE Transactions on Information Theory.

[21]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[22]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[23]  Hans S. Witsenhausen,et al.  A conditional entropy bound for a pair of discrete random variables , 1975, IEEE Trans. Inf. Theory.

[24]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[25]  Jack K. Wolf,et al.  Transmission of noisy information to a noisy receiver with minimum distortion , 1970, IEEE Trans. Inf. Theory.

[26]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[27]  Inaki Estella Aguerri,et al.  Distributed Variational Representation Learning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  Liam Paninski,et al.  Estimating entropy on m bins given fewer than m samples , 2004, IEEE Transactions on Information Theory.

[30]  Toby Berger,et al.  A semi-continuous version of the Berger-Yeung problem , 1999, IEEE Trans. Inf. Theory.

[31]  David J. Schwab,et al.  The Deterministic Information Bottleneck , 2015, Neural Computation.

[32]  Stefano Soatto,et al.  Where is the Information in a Deep Neural Network? , 2019, ArXiv.

[33]  Alfred O. Hero,et al.  Scalable Mutual Information Estimation Using Dependence Graphs , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[35]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[36]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[37]  Robert Jenssen,et al.  Understanding Convolutional Neural Network Training with Information Theory , 2018, ArXiv.

[38]  Imre Csiszár,et al.  On the dimension and entropy of order α of the mixture of probability distributions , 1962 .

[39]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[40]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[41]  Yanjun Han,et al.  Adaptive estimation of Shannon entropy , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[42]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[43]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[44]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[45]  Michael Kampffmeyer,et al.  Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels , 2019, ArXiv.

[46]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[47]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[48]  H. Hotelling The most predictable criterion. , 1935 .

[49]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[50]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[51]  Abbas El Gamal,et al.  Network Information Theory , 2021, 2021 IEEE 3rd International Conference on Advanced Trends in Information Theory (ATIT).

[52]  Brian Kingsbury,et al.  Estimating Information Flow in Neural Networks , 2018, ArXiv.

[53]  David A. Wagner,et al.  Towards Evaluating the Robustness of Neural Networks , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[54]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[55]  Kristjan H. Greenewald,et al.  Optimality of the Plug-in Estimator for Differential Entropy Estimation under Gaussian Convolutions , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[56]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[57]  Ziv Goldfeld,et al.  Limit Distributions for Smooth Total Variation and χ2-Divergence in High Dimensions , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[58]  Inaki Estella Aguerri,et al.  Distributed Information Bottleneck Method for Discrete and Gaussian Sources , 2017, ArXiv.

[59]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[60]  E. Lehmann,et al.  Completeness, Similar Regions, and Unbiased Estimation—Part II , 2012 .

[61]  Yanjun Han,et al.  Optimal rates of entropy estimation over Lipschitz balls , 2017, The Annals of Statistics.

[62]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[63]  Panos Toulis,et al.  Convergence diagnostics for stochastic gradient descent with constant learning rate , 2018, AISTATS.

[65]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[66]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[68]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[69]  Kristjan H. Greenewald,et al.  Convergence of Smoothed Empirical Measures With Applications to Entropy Estimation , 2019, IEEE Transactions on Information Theory.

[70]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[71]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[72]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[73]  Jack K. Wolf,et al.  Noiseless coding of correlated information sources , 1973, IEEE Trans. Inf. Theory.