On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views

This tutorial paper focuses on the variants of the bottleneck problem taking an information theoretic perspective and discusses practical methods to solve it, as well as its connection to coding and learning aspects. The intimate connections of this setting to remote source-coding under logarithmic loss distortion measure, information combining, common reconstruction, the Wyner–Ahlswede–Korner problem, the efficiency of investment information, as well as, generalization, variational inference, representation learning, autoencoders, and others are highlighted. We discuss its extension to the distributed information bottleneck problem with emphasis on the Gaussian model and highlight the basic connections to the uplink Cloud Radio Access Networks (CRAN) with oblivious processing. For this model, the optimal trade-offs between relevance (i.e., information) and complexity (i.e., rates) in the discrete and vector Gaussian frameworks is determined. In the concluding outlook, some interesting problems are mentioned such as the characterization of the optimal inputs (“features”) distributions under power limitations maximizing the “relevance” for the Gaussian information bottleneck, under “complexity” constraints.

[1]  Prakash Narayan,et al.  Reliable Communication Under Channel Uncertainty , 1998, IEEE Trans. Inf. Theory.

[2]  Leonardo Rey Vega,et al.  The Role of Information Complexity and Randomization in Representation Learning , 2018, ArXiv.

[3]  Hans S. Witsenhausen,et al.  A conditional entropy bound for a pair of discrete random variables , 1975, IEEE Trans. Inf. Theory.

[4]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[5]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[6]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[7]  Inaki Estella Aguerri,et al.  A generalization of blahut-arimoto algorithm to compute rate-distortion regions of multiterminal source coding under logarithmic loss , 2017, 2017 IEEE Information Theory Workshop (ITW).

[8]  Paul Valiant,et al.  Estimating the Unseen , 2013, NIPS.

[9]  Fady Alajaji,et al.  Information Extraction Under Privacy Constraints , 2015, Inf..

[10]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[11]  Trevor Darrell,et al.  Factorized Latent Spaces with Structured Sparsity , 2010, NIPS.

[12]  Abdellatif Zaidi,et al.  Rate-Distortion Function for a Heegard-Berger Problem With Two Sources and Degraded Reconstruction Sets , 2016, IEEE Transactions on Information Theory.

[13]  Yossef Steinberg,et al.  Coding and Common Reconstruction , 2009, IEEE Transactions on Information Theory.

[14]  H. Hotelling The most predictable criterion. , 1935 .

[15]  Sennur Ulukus,et al.  An Outer Bound for the Vector Gaussian CEO Problem , 2014, IEEE Transactions on Information Theory.

[16]  Tsachy Weissman,et al.  Multiterminal Source Coding Under Logarithmic Loss , 2011, IEEE Transactions on Information Theory.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Aaron D. Wyner,et al.  On source coding with side information at the decoder , 1975, IEEE Trans. Inf. Theory.

[19]  Naftali Tishby,et al.  The Information Bottleneck Revisited or How to Choose a Good Distortion Measure , 2007, 2007 IEEE International Symposium on Information Theory.

[20]  Abdellatif Zaidi,et al.  Rate-Distortion of a Heegard-Berger Problem with Common Reconstruction , 2016 .

[21]  Inaki Estella Aguerri,et al.  Vector Gaussian CEO Problem Under Logarithmic Loss , 2018, 2018 IEEE Information Theory Workshop (ITW).

[22]  Adi Homri,et al.  Oblivious Fronthaul-Constrained Relay for a Gaussian Channel , 2018, IEEE Transactions on Communications.

[23]  Hal Daumé,et al.  A Co-training Approach for Multi-view Spectral Clustering , 2011, ICML.

[24]  Inaki Estella Aguerri,et al.  Distributed Information Bottleneck Method for Discrete and Gaussian Sources , 2017, ArXiv.

[25]  Grant Schoenebeck,et al.  Water from Two Rocks: Maximizing the Mutual Information , 2018, EC.

[26]  Gregory W. Wornell,et al.  On the Universality of the Logistic Loss Function , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[27]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[28]  Inaki Estella Aguerri,et al.  Distributed Variational Representation Learning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Dean P. Foster,et al.  Multi-View Learning of Word Embeddings via CCA , 2011, NIPS.

[30]  Paul A. Viola,et al.  Alignment by Maximization of Mutual Information , 1997, International Journal of Computer Vision.

[31]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[32]  Gerald Matz,et al.  The rate-information trade-off for Gaussian vector channels , 2014, 2014 IEEE International Symposium on Information Theory.

[33]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[34]  Deniz Erdoğmuş INFORMATION THEORETIC LEARNING: RENYI'S ENTROPY AND ITS APPLICATIONS TO ADAPTIVE SYSTEM TRAINING , 2002 .

[35]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[36]  Sayandev Mukherjee General Information Bottleneck Objectives and their Applications to Machine Learning , 2019, ArXiv.

[37]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[38]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[39]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[40]  Wilhelm Burger,et al.  Digital Image Processing - An Algorithmic Introduction using Java , 2008, Texts in Computer Science.

[41]  Yonina C. Eldar,et al.  Analog-to-Digital Compression: A New Paradigm for Converting Signals to Bits , 2018, IEEE Signal Processing Magazine.

[42]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[43]  Tsachy Weissman,et al.  Justification of Logarithmic Loss via the Benefit of Side Information , 2014, IEEE Transactions on Information Theory.

[44]  Sae-Young Chung,et al.  Noisy Network Coding , 2010, IEEE Transactions on Information Theory.

[45]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[46]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[47]  Alexander A. Alemi,et al.  Variational Predictive Information Bottleneck , 2019, AABI.

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  Shlomo Shamai,et al.  Mutual information and minimum mean-square error in Gaussian channels , 2004, IEEE Transactions on Information Theory.

[50]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[51]  Volker Roth,et al.  On the Difference between the Information Bottleneck and the Deep Information Bottleneck , 2020, Entropy.

[52]  James Zou,et al.  How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage , 2015, IEEE Transactions on Information Theory.

[53]  Shlomo Shamai,et al.  On the capacity of cloud radio access networks with oblivious relaying , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[54]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[55]  V. Kvasnicka,et al.  Neural and Adaptive Systems: Fundamentals Through Simulations , 2001, IEEE Trans. Neural Networks.

[56]  Shlomo Shamai,et al.  Communication Via Decentralized Processing , 2005, IEEE Transactions on Information Theory.

[57]  Thomas M. Cover,et al.  Network Information Theory , 2001 .

[58]  Sayandev Mukherjee Machine Learning using the Variational Predictive Information Bottleneck with a Validation Set , 2019, ArXiv.

[59]  Samy Bengio,et al.  Adversarial Machine Learning at Scale , 2016, ICLR.

[60]  J. Príncipe,et al.  Nonlinear extensions to the minimum average correlation energy filter , 1997 .

[61]  Gianluca Bontempi,et al.  On the Impact of Entropy Estimation on Transcriptional Regulatory Network Inference Based on Mutual Information , 2008, EURASIP J. Bioinform. Syst. Biol..

[62]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[63]  Elza Erkip,et al.  The Efficiency of Investment Information , 1998, IEEE Trans. Inf. Theory.

[64]  Rudolf Ahlswede,et al.  Source coding with side information and a converse for degraded broadcast channels , 1975, IEEE Trans. Inf. Theory.

[65]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[66]  Max A. Viergever,et al.  Mutual-information-based registration of medical images: a survey , 2003, IEEE Transactions on Medical Imaging.

[67]  Mikael Skoglund,et al.  The Convex Information Bottleneck Lagrangian , 2020, Entropy.

[68]  Aleksander Madry,et al.  Exploring the Landscape of Spatial Robustness , 2017, ICML.

[69]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[70]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[71]  David J. Schwab,et al.  The Deterministic Information Bottleneck , 2015, Neural Computation.

[72]  Amir Dembo,et al.  Information theoretic inequalities , 1991, IEEE Trans. Inf. Theory.

[73]  Aleksander Madry,et al.  A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations , 2017, ArXiv.

[74]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Johannes B. Huber,et al.  Information Combining , 2006, Found. Trends Commun. Inf. Theory.

[76]  Abbas El Gamal,et al.  Capacity theorems for the relay channel , 1979, IEEE Trans. Inf. Theory.

[77]  Ankit Pensia,et al.  Extracting Robust and Accurate Features via a Robust Information Bottleneck , 2019, IEEE Journal on Selected Areas in Information Theory.

[78]  Naftali Tishby,et al.  An Information Theoretic Tradeoff between Complexity and Accuracy , 2003, COLT.

[79]  Shlomo Shamai,et al.  On Codebook Information for Interference Relay Channels With Out-of-Band Relaying , 2011, IEEE Transactions on Information Theory.

[80]  Yonina C. Eldar,et al.  Channel Capacity Under Sub-Nyquist Nonuniform Sampling , 2012, IEEE Transactions on Information Theory.

[81]  Deniz Gündüz,et al.  Distributed Hypothesis Testing Under Privacy Constraints , 2018, 2018 IEEE Information Theory Workshop (ITW).

[82]  Shlomo Shamai,et al.  Robust Uplink Communications over Fading Channels with Variable Backhaul Connectivity , 2013, IEEE Transactions on Wireless Communications.

[83]  Chao Tian,et al.  Successive Refinement for Hypothesis Testing and Lossless One-Helper Problem , 2008, IEEE Transactions on Information Theory.

[84]  Hans S. Witsenhausen,et al.  Indirect rate distortion problems , 1980, IEEE Trans. Inf. Theory.

[85]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[86]  Flávio du Pin Calmon,et al.  Privacy against statistical inference , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[87]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[88]  Aaron D. Wyner,et al.  The rate-distortion function for source coding with side information at the decoder , 1976, IEEE Trans. Inf. Theory.

[89]  Suguru Arimoto,et al.  An algorithm for computing the capacity of arbitrary discrete memoryless channels , 1972, IEEE Trans. Inf. Theory.

[90]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[91]  Yanjun Han,et al.  Minimax Estimation of Functionals of Discrete Distributions , 2014, IEEE Transactions on Information Theory.

[92]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[93]  Gerhard Bauch,et al.  Decoding of Non-Binary LDPC Codes using the Information Bottleneck Method , 2019, ICC 2019 - 2019 IEEE International Conference on Communications (ICC).

[94]  Gerald Matz,et al.  Rate-information-optimal Gaussian channel output compression , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[95]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[96]  Fady Alajaji,et al.  Estimation Efficiency Under Privacy Constraints , 2017, IEEE Transactions on Information Theory.

[97]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[98]  Muriel Médard,et al.  From the Information Bottleneck to the Privacy Funnel , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[99]  Imre Csiszár,et al.  Information Theory - Coding Theorems for Discrete Memoryless Systems, Second Edition , 2011 .

[100]  Shao-Lun Huang,et al.  On Universal Features for High-Dimensional Learning and Inference , 2019, ArXiv.

[101]  Naftali Tishby,et al.  Gaussian Lower Bound for the Information Bottleneck Limit , 2017, J. Mach. Learn. Res..

[102]  Giuseppe Caire,et al.  Privacy-constrained remote source coding , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[103]  Robert Jenssen,et al.  Understanding Convolutional Neural Network Training with Information Theory , 2018, ArXiv.

[104]  Richard E. Blahut,et al.  Computation of channel capacity and rate-distortion functions , 1972, IEEE Trans. Inf. Theory.

[105]  Olivier Marre,et al.  Relevant sparse codes with variational information bottleneck , 2016, NIPS.

[106]  Boris Tsybakov,et al.  Information transmission with additional noise , 1962, IRE Trans. Inf. Theory.

[107]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[108]  Shlomo Shamai,et al.  Extremes of information combining , 2005, IEEE Transactions on Information Theory.

[109]  David J. Schwab,et al.  The Information Bottleneck and Geometric Clustering , 2017, Neural Computation.

[110]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[111]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[112]  José Carlos Príncipe,et al.  Understanding Autoencoders with Information Theoretic Concepts , 2018, Neural Networks.