Nonlinear Information Bottleneck

Information bottleneck (IB) is a technique for extracting information in one random variable X that is relevant for predicting another random variable Y. IB works by encoding X in a compressed “bottleneck” random variable M from which Y can be accurately decoded. However, finding the optimal bottleneck variable involves a difficult optimization problem, which until recently has been considered for only two limited cases: discrete X and Y with small state spaces, and continuous X and Y with a Gaussian joint distribution (in which case optimal encoding and decoding maps are linear). We propose a method for performing IB on arbitrarily-distributed discrete and/or continuous X and Y, while allowing for nonlinear encoding and decoding maps. Our approach relies on a novel non-parametric upper bound for mutual information. We describe how to implement our method using neural networks. We then show that it achieves better performance than the recently-proposed “variational IB” method on several real-world datasets.

[1]  Hans S. Witsenhausen,et al.  A conditional entropy bound for a pair of discrete random variables , 1975, IEEE Trans. Inf. Theory.

[2]  G. Wahba Optimal Convergence Properties of Variable Knot, Kernel, and Orthogonal Series Methods for Density Estimation. , 1975 .

[3]  Rudolf Ahlswede,et al.  Source coding with side information and a converse for degraded broadcast channels , 1975, IEEE Trans. Inf. Theory.

[4]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[5]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[6]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[7]  Gustavo Deco,et al.  Elimination of Overtraining by a Mutual Information Network , 1993 .

[8]  P. Hall,et al.  On the estimation of entropy , 1993 .

[9]  G. Cottrell Optimization of Entropy with Neural Networks , 1995 .

[10]  Nicole Norbert Schraudolph Optimization of entropy with neural networks , 1996 .

[11]  R. Pace,et al.  Sparse spatial autoregressions , 1997 .

[12]  Kaisa Miettinen,et al.  Nonlinear multiobjective optimization , 1998, International series in operations research and management science.

[13]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[14]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[15]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[16]  Richard S. Zemel,et al.  Minimizing Description Length in an Unsupervised Neural Network , 2000 .

[17]  A. Dimitrov,et al.  Neural coding and decoding: communication channels and quantization , 2001, Network.

[18]  Claude Lemaréchal,et al.  Lagrangian Relaxation , 2000, Computational Combinatorial Optimization.

[19]  Inés Samengo,et al.  Information Loss in an Optimal Maximum Likelihood Decoding , 2001, Neural Computation.

[20]  Naftali Tishby,et al.  An Information Theoretic Tradeoff between Complexity and Accuracy , 2003, COLT.

[21]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[22]  Jean Cardinal,et al.  Compression of side information , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[23]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[24]  Nicol N. Schraudolph,et al.  Gradient-based manipulation of nonparametric entropy estimates , 2004, IEEE Transactions on Neural Networks.

[25]  Antonio Criminisi,et al.  Object categorization by learned universal visual dictionary , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[26]  Yoav Y. Schechner,et al.  Fast kernel entropy estimation and optimization , 2005, Signal Process..

[27]  K. Hlavácková-Schindler,et al.  Causality detection based on information-theoretic approaches in time series analysis , 2007 .

[28]  J. Widmer,et al.  Design of network coding functions in multihop relay networks , 2008, 2008 5th International Symposium on Turbo Codes and Related Topics.

[29]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[30]  Naftali Tishby,et al.  Speaker recognition by Gaussian information bottleneck , 2009, INTERSPEECH.

[31]  Naftali Tishby,et al.  Past-future information bottleneck in dynamical systems. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[32]  Svetlana Lazebnik,et al.  Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theoretical Computer Science.

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[35]  Richard D. Wesel,et al.  Multiterminal source coding with an entropy-based distortion measure , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[36]  Ruhi Sarikaya,et al.  Bottleneck features for speaker recognition , 2012, Odyssey.

[37]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[38]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[39]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[42]  Stefano Soatto,et al.  Information Dropout: learning optimal representations through noise , 2017, ArXiv.

[43]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[44]  Olivier Marre,et al.  Relevant sparse codes with variational information bottleneck , 2016, NIPS.

[45]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[46]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[47]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[48]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[49]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[50]  David J. Schwab,et al.  The Deterministic Information Bottleneck , 2015, Neural Computation.

[51]  Richard C. Hendriks,et al.  On the information rate of speech communication , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[53]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[54]  Leonardo Rey Vega,et al.  The Role of the Information Bottleneck in Representation Learning , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[55]  Charles Kemp,et al.  Efficient compression in color naming and its evolution , 2018, Proceedings of the National Academy of Sciences.

[56]  Alexander A. Alemi,et al.  Uncertainty in the Variational Information Bottleneck , 2018, ArXiv.

[57]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[58]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Shohreh Kasaei,et al.  Information Bottleneck and its Applications in Deep Learning , 2019, ArXiv.

[60]  Artemy Kolchinsky,et al.  Caveats for information bottleneck in deterministic scenarios , 2018, ICLR.

[61]  Steven Van Kuyk,et al.  Speech Communication from an Information Theoretical Perspective , 2019 .

[62]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Borja Rodr'iguez G'alvez,et al.  The Convex Information Bottleneck Lagrangian , 2019, Entropy.