Information-Theoretic Methods in Deep Neural Networks: Recent Advances and Emerging Opportunities

We present a review on the recent advances and emerging opportunities around the theme of analyzing deep neural networks (DNNs) with information-theoretic methods. We first discuss popular information-theoretic quantities and their estimators. We then introduce recent developments on information-theoretic learning principles (e.g., loss functions, regularizers and objectives) and their parameterization with DNNs. We finally briefly review current usages of informationtheoretic concepts in a few modern machine learning problems and list a few emerging opportunities.

[1]  Gunnar Rätsch,et al.  Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection , 2004, J. Mach. Learn. Res..

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  V. Moskvina,et al.  An Algorithm Based on Singular Spectrum Analysis for Change-Point Detection , 2003 .

[5]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[6]  Robert Jenssen,et al.  Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional , 2020, IEEE transactions on pattern analysis and machine intelligence.

[7]  Zhaohui Wu,et al.  Robust feature learning by stacked autoencoder with maximum correntropy criterion , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[9]  Michael Satosi Watanabe,et al.  Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..

[10]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[11]  Ziv Goldfeld,et al.  The Information Bottleneck Problem and its Applications in Machine Learning , 2020, IEEE Journal on Selected Areas in Information Theory.

[12]  José Carlos Príncipe,et al.  Balancing exploration and exploitation in reinforcement learning using a value of information criterion , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  José Carlos Príncipe,et al.  Understanding Autoencoders with Information Theoretic Concepts , 2018, Neural Networks.

[14]  Takafumi Kanamori,et al.  Mutual information estimation reveals global associations between stimuli and biological processes , 2009, BMC Bioinformatics.

[15]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[16]  Thomas Steinke,et al.  Reasoning About Generalization via Conditional Mutual Information , 2020, COLT.

[17]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[18]  Stefano Ermon,et al.  A Theory of Usable Information Under Computational Constraints , 2020, ICLR.

[19]  Marissa Connor,et al.  Generative causal explanations of black-box classifiers , 2020, NeurIPS.

[20]  Yizhou Wang,et al.  L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise , 2019, NeurIPS.

[21]  Edwin R. Hancock,et al.  Generative Graph Prototypes from Information Theory , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[23]  Wojciech Czarnecki,et al.  On Loss Functions for Deep Neural Networks in Classification , 2017, ArXiv.

[24]  Hossein Mobahi,et al.  Predicting the Generalization Gap in Deep Networks with Margin Distributions , 2018, ICLR.

[25]  José Carlos Príncipe,et al.  Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications , 2020, IJCAI.

[26]  Daniel J. Velleman American Mathematical Monthly , 2010 .

[27]  Mehryar Mohri,et al.  Domain adaptation and sample bias correction theory and algorithm for regression , 2014, Theor. Comput. Sci..

[28]  Barnabás Póczos,et al.  Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs , 2010, NIPS.

[29]  Yingsong Li,et al.  Maximum Correntropy Criterion With Variable Center , 2019, IEEE Signal Processing Letters.

[30]  Yochai Blau,et al.  Direct Validation of the Information Bottleneck Principle for Deep Nets , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[31]  Masashi Sugiyama,et al.  Least-squares independence regression for non-linear causal inference under non-Gaussian noise , 2011, Machine Learning.

[32]  Cian O'Donnell,et al.  Adaptive Estimators Show Information Compression in Deep Neural Networks , 2019, ICLR.

[33]  Le Song,et al.  Learning to Explain: An Information-Theoretic Perspective on Model Interpretation , 2018, ICML.

[34]  Yoshua Bengio,et al.  Towards Causal Representation Learning , 2021, ArXiv.

[35]  Shlomo Shamai,et al.  On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views , 2020, Entropy.

[36]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[37]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Jagat Narain Kapur,et al.  Measures of information and their applications , 1994 .

[39]  Karl Stratos,et al.  Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[40]  Deniz Erdogmus,et al.  An error-entropy minimization algorithm for supervised training of nonlinear adaptive systems , 2002, IEEE Trans. Signal Process..

[41]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[42]  Thierry Paul,et al.  Quantum computation and quantum information , 2007, Mathematical Structures in Computer Science.

[43]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[44]  I. Stravinsky,et al.  Gestural Control of Real-Time Concatenative Synthesis in Luna Park Grégory Beller Computer Music , 2011 .

[45]  Shaofeng Zou,et al.  Tightening Mutual Information Based Bounds on Generalization Error , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[46]  Yunmei Chen,et al.  On Kernel Method–Based Connectionist Models and Supervised Deep Learning Without Backpropagation , 2020, Neural Computation.

[47]  José Carlos Príncipe,et al.  Training a Bank of Wiener Models with a Novel Quadratic Mutual Information Cost Function , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[49]  Aram Galstyan,et al.  Discovering Structure in High-Dimensional Data Through Correlation Explanation , 2014, NIPS.

[50]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[51]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[52]  Weifeng Liu,et al.  Correntropy: Properties and Applications in Non-Gaussian Signal Processing , 2007, IEEE Transactions on Signal Processing.

[53]  Robert Jenssen,et al.  Measuring Dependence with Matrix-based Entropy Functional , 2021, AAAI.

[54]  Rob Brekelmans,et al.  Auto-Encoding Total Correlation Explanation , 2018, AISTATS.

[55]  Pramod Viswanath,et al.  Demystifying fixed k-nearest neighbor information estimators , 2016, 2017 IEEE International Symposium on Information Theory (ISIT).

[56]  Jose C. Principe,et al.  Deep Deterministic Information Bottleneck with Matrix-Based Entropy Functional , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[57]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Constantine Kotropoulos,et al.  Robust Multidimensional Scaling Using a Maximum Correntropy Criterion , 2017, IEEE Transactions on Signal Processing.

[59]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[60]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[61]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[62]  L. Pardo Statistical Inference Based on Divergence Measures , 2005 .

[63]  Robert Jenssen,et al.  The Cauchy-Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels , 2006, J. Frankl. Inst..

[64]  Yonatan Belinkov,et al.  Variational Information Bottleneck for Effective Low-Resource Fine-Tuning , 2021, ICLR.

[65]  Barnabás Póczos,et al.  On the Estimation of alpha-Divergences , 2011, AISTATS.

[66]  Uri Shalit,et al.  Robust learning with the Hilbert-Schmidt independence criterion , 2019, ICML.

[67]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[68]  Jose C. Principe,et al.  Measuring the Discrepancy between Conditional Distributions: Methods, Properties and Applications , 2020, IJCAI.

[69]  Chelsea Finn,et al.  Meta-Learning without Memorization , 2020, ICLR.

[70]  Alfred O. Hero,et al.  Scalable Mutual Information Estimation Using Dependence Graphs , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[72]  Bernhard Schölkopf,et al.  Regression by dependence minimization and its application to causal inference in additive noise models , 2009, ICML '09.

[73]  G. Crooks On Measures of Entropy and Information , 2015 .

[74]  Leonardo Rey Vega,et al.  Compression-Based Regularization With an Application to Multitask Learning , 2017, IEEE Journal of Selected Topics in Signal Processing.

[75]  Bernhard C. Geiger,et al.  Understanding Individual Neuron Importance Using Information Theory , 2018, ArXiv.

[76]  Germán Castellanos-Domínguez,et al.  Relevant information undersampling to support imbalanced data classification , 2021, Neurocomputing.

[77]  Wei Wu,et al.  Explaining a black-box using Deep Variational Information Bottleneck Approach , 2019, ArXiv.

[78]  Fengmao Lv,et al.  Can Cross Entropy Loss Be Robust to Label Noise? , 2020, IJCAI.

[79]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[80]  David H. Wolpert,et al.  Nonlinear Information Bottleneck , 2017, Entropy.

[81]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[82]  Sergey Levine,et al.  Wasserstein Dependency Measure for Representation Learning , 2019, NeurIPS.

[83]  W. Marsden I and J , 2012 .

[84]  Yu Cheng,et al.  InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective , 2020, ArXiv.

[85]  Rishabh Singh,et al.  Time Series Analysis using a Kernel based Multi-Modal Uncertainty Decomposition Framework , 2020, UAI.

[86]  Badong Chen,et al.  Insights Into the Robustness of Minimum Error Entropy Estimation , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[87]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[88]  Jure Leskovec,et al.  Graph Information Bottleneck , 2020, NeurIPS.

[89]  Nicolas Le Roux,et al.  Understanding the impact of entropy on policy optimization , 2018, ICML.

[90]  Nicky J Welton,et al.  Value of Information , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.

[91]  R. J. Joenk,et al.  IBM journal of research and development: information for authors , 1978 .

[92]  Barnabás Póczos,et al.  Generalized Exponential Concentration Inequality for Renyi Divergence Estimation , 2014, ICML.

[93]  Yantao Wei,et al.  Multiscale principle of relevant information for hyperspectral image classification , 2019, Machine Learning.

[94]  Lantao Yu,et al.  Training Deep Energy-Based Models with f-Divergence Minimization , 2020, ICML.