How Does Information Bottleneck Help Deep Learning?

Numerous deep learning algorithms have been inspired by and understood via the notion of information bottleneck, where unnecessary information is (often implicitly) minimized while task-relevant information is maximized. However, a rigorous argument for justifying why it is desirable to control information bottlenecks has been elusive. In this paper, we provide the first rigorous learning theory for justifying the benefit of information bottleneck in deep learning by mathematically relating information bottleneck to generalization errors. Our theory proves that controlling information bottleneck is one way to control generalization errors in deep learning, although it is not the only or necessary way. We investigate the merit of our new mathematical findings with experiments across a range of architectures and learning settings. In many cases, generalization errors are shown to correlate with the degree of information bottleneck: i.e., the amount of the unnecessary information at hidden layers. This paper provides a theoretical foundation for current and future methods through the lens of information bottleneck. Our new generalization bounds scale with the degree of information bottleneck, unlike the previous bounds that scale with the number of parameters, VC dimension, Rademacher complexity, stability or robustness. Our code is publicly available at: https://github.com/xu-ji/information-bottleneck

[1]  Lingpeng Kong,et al.  Explanation Regeneration via Information Bottleneck , 2022, ACL.

[2]  Sanyam Kapoor,et al.  PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization , 2022, NeurIPS.

[3]  Lan V. Truong On Rademacher Complexity-based Generalization Bounds for Deep Learning , 2022, ArXiv.

[4]  B. Schölkopf,et al.  Discrete Key-Value Bottleneck , 2022, ICML.

[5]  Graham W. Taylor,et al.  Bounding generalization error with input compression: An empirical study with infinite-width networks , 2022, Trans. Mach. Learn. Res..

[6]  Kenji Kawaguchi,et al.  Robustness Implies Generalization via Data-Dependent Generalization Bounds , 2022, ICML.

[7]  Alex Lamb,et al.  Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization , 2022, AAAI.

[8]  Philip S. Yu,et al.  Graph Structure Learning with Variational Information Bottleneck , 2021, AAAI.

[9]  Cynthia Dwork,et al.  Scaffolding Sets , 2021, ArXiv.

[10]  Yiping Lu,et al.  An Unconstrained Layer-Peeled Perspective on Neural Collapse , 2021, ICLR.

[11]  Linjun Zhang,et al.  The Power of Contrast for Feature Learning: A Theoretical Analysis , 2021, ArXiv.

[12]  John Canny,et al.  Compressive Visual Representations , 2021, NeurIPS.

[13]  G. Karniadakis,et al.  When Do Extended Physics-Informed Neural Networks (XPINNs) Improve Generalization? , 2021, SIAM J. Sci. Comput..

[14]  Sergey Levine,et al.  Robust Predictable Control , 2021, NeurIPS.

[15]  Chen Sun,et al.  Discrete-Valued Neural Communication , 2021, NeurIPS.

[16]  K. Keutzer,et al.  Invariant Information Bottleneck for Domain Generalization , 2021, AAAI.

[17]  Samy Bengio,et al.  Understanding deep learning (still) requires rethinking generalization , 2021, Commun. ACM.

[18]  Yoshua Bengio,et al.  Inductive biases for deep learning of higher-level cognition , 2020, Proceedings of the Royal Society A.

[19]  Shohreh Kasaei,et al.  Sample complexity of classification with compressed input , 2020, Neurocomputing.

[20]  Hangfeng He,et al.  Toward Better Generalization Bounds with Locally Elastic Stability , 2020, ICML.

[21]  James Y. Zou,et al.  How Does Mixup Help With Robustness and Generalization? , 2020, ICLR.

[22]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[23]  Gintare Karolina Dziugaite,et al.  On the role of data in PAC-Bayes bounds , 2020, ArXiv.

[24]  G. Durisi,et al.  Generalization Bounds via Information Density and Conditional Information Density , 2020, IEEE Journal on Selected Areas in Information Theory.

[25]  Frank Allgöwer,et al.  Training Robust Neural Networks Using Lipschitz Bounds , 2020, IEEE Control Systems Letters.

[26]  Paul Rolland,et al.  Lipschitz constant estimation of Neural Networks via sparse polynomial optimization , 2020, ICLR.

[27]  Yann Chevaleyre,et al.  Randomization matters. How to defend against strong adversarial attacks , 2020, ICML.

[28]  Zeynep Akata,et al.  Learning Robust Representations via Multi-View Information Bottleneck , 2020, ICLR.

[29]  Ian S. Fischer,et al.  The Conditional Entropy Bottleneck , 2020, Entropy.

[30]  Thomas Steinke,et al.  Reasoning About Generalization via Conditional Mutual Information , 2020, COLT.

[31]  Michael Unser,et al.  Deep Neural Networks With Trainable Activations and Controlled Lipschitz Constant , 2020, IEEE Transactions on Signal Processing.

[32]  Alexander Levine,et al.  Robustness Certificates for Sparse Adversarial Attacks by Randomized Ablation , 2019, AAAI.

[33]  Manfred Morari,et al.  Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks , 2019, NeurIPS.

[34]  Conor J. Houghton,et al.  Adaptive Estimators Show Information Compression in Deep Neural Networks , 2019, ICLR.

[35]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[36]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[37]  Hisashi Kashima,et al.  Theoretical evidence for adversarial robustness through randomization: the case of the Exponential family , 2019, NeurIPS.

[38]  Brian Kingsbury,et al.  Estimating Information Flow in Deep Neural Networks , 2018, ICML.

[39]  Naftali Tishby,et al.  REPRESENTATION COMPRESSION AND GENERALIZATION IN DEEP NEURAL NETWORKS , 2018 .

[40]  Leonardo Rey Vega,et al.  The Role of the Information Bottleneck in Representation Learning , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[41]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[43]  Alan L. Yuille,et al.  Mitigating adversarial effects through randomization , 2017, ICLR.

[44]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[45]  Leslie Pack Kaelbling,et al.  Generalization in Deep Learning , 2017, ArXiv.

[46]  Raef Bassily,et al.  Learners that Use Little Information , 2017, ALT.

[47]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[48]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[49]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[50]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[51]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[52]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[53]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[54]  Diederik P. Kingma,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[55]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[56]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[57]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theor. Comput. Sci..

[58]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[59]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[60]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[61]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[62]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[63]  Unpacking Information Bottlenecks: Surrogate Objec- tives for Deep Learning , 2020 .

[64]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[65]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[66]  Dominik D. Freydenberger,et al.  Can We Learn to Gamble Efficiently? , 2010, COLT.

[67]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[68]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[69]  P. Cincotta,et al.  Conditional Entropy , 1999 .