Estimating Information Flow in Neural Networks

We study the flow of information and the evolution of internal representations during deep neural network (DNN) training, aiming to demystify the compression aspect of the information bottleneck theory. The theory suggests that DNN training comprises a rapid fitting phase followed by a slower compression phase, in which the mutual information $I(X;T)$ between the input $X$ and internal representations $T$ decreases. Several papers observe compression of estimated mutual information on different DNN models, but the true $I(X;T)$ over these networks is provably either constant (discrete $X$) or infinite (continuous $X$). This work explains the discrepancy between theory and experiments, and clarifies what was actually measured by these past works. To this end, we introduce an auxiliary (noisy) DNN framework for which $I(X;T)$ is a meaningful quantity that depends on the network's parameters. This noisy framework is shown to be a good proxy for the original (deterministic) DNN both in terms of performance and the learned representations. We then develop a rigorous estimator for $I(X;T)$ in noisy DNNs and observe compression in various models. By relating $I(X;T)$ in the noisy DNN to an information-theoretic communication problem, we show that compression is driven by the progressive clustering of hidden representations of inputs from the same class. Several methods to directly monitor clustering of hidden representations, both in noisy and deterministic DNNs, are used to show that meaningful clusters form in the $T$ space. Finally, we return to the estimator of $I(X;T)$ employed in past works, and demonstrate that while it fails to capture the true (vacuous) mutual information, it does serve as a measure for clustering. This clarifies the past observations of compression and isolates the geometric clustering of hidden representations as the true phenomenon of interest.

[1]  Yanjun Han,et al.  Optimal rates of entropy estimation over Lipschitz balls , 2017, The Annals of Statistics.

[2]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[3]  Barnabás Póczos,et al.  Finite-Sample Analysis of Fixed-k Nearest Neighbor Density Functional Estimators , 2016, NIPS.

[4]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[5]  Yoshua Bengio,et al.  Mutual Information Neural Estimation , 2018, ICML.

[6]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[8]  Jeesen Chen,et al.  A general lower bound of minimax risk for absolute‐error loss , 1997 .

[9]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[10]  P. Hall,et al.  On the estimation of entropy , 1993 .

[11]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[12]  Yanjun Han,et al.  The Nearest Neighbor Information Estimator is Adaptively Near Minimax Rate-Optimal , 2017, NeurIPS.

[13]  Alfred O. Hero,et al.  Estimation of Nonlinear Functionals of Densities With Confidence , 2012, IEEE Transactions on Information Theory.

[14]  James M. Robins,et al.  Nonparametric von Mises Estimators for Entropies, Divergences and Mutual Informations , 2015, NIPS.

[15]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[16]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[17]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[18]  P. Hall Limit theorems for sums of general functions of m-spacings , 1984, Mathematical Proceedings of the Cambridge Philosophical Society.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  Yihong Wu,et al.  Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation , 2014, IEEE Transactions on Information Theory.

[22]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[23]  Kristjan H. Greenewald,et al.  Optimality of the Plug-in Estimator for Differential Entropy Estimation under Gaussian Convolutions , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[24]  Yu. Golubev,et al.  On entropy estimation by m-spacing method , 2009 .

[25]  H. Joe Estimation of entropy and other functionals of a multivariate density , 1989 .

[26]  A. Tsybakov,et al.  Root-N consistent estimators of entropy for densities with unbounded support , 1994, Proceedings of 1994 Workshop on Information Theory and Statistics.

[27]  Rémi Bardenet,et al.  Monte Carlo Methods , 2013, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[28]  Moustapha Cissé,et al.  Parseval Networks: Improving Robustness to Adversarial Examples , 2017, ICML.

[29]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[31]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Kristjan H. Greenewald,et al.  Estimating Differential Entropy under Gaussian Convolutions , 2018, 1810.11589.

[33]  Thomas B. Berrett,et al.  Efficient multivariate entropy estimation via $k$-nearest neighbour distances , 2016, The Annals of Statistics.

[34]  Yihong Wu,et al.  Wasserstein Continuity of Entropy and Outer Bounds for Interference Channels , 2015, IEEE Transactions on Information Theory.

[35]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[36]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[37]  Artemy Kolchinsky,et al.  Caveats for information bottleneck in deterministic scenarios , 2018, ICLR.