Entropy and mutual information in models of deep neural networks

We examine a class of deep learning models with a tractable method to compute information-theoretic quantities. Our contributions are three-fold: (i) We show how entropies and mutual informations can be derived from heuristic statistical physics methods, under the assumption that weight matrices are independent and orthogonally-invariant. (ii) We extend particular cases in which this result is known to be rigorously exact by providing a proof for two-layers networks with Gaussian random weights, using the recently introduced adaptive interpolation method. (iii) We propose an experiment framework with generative models of synthetic datasets, on which we train deep neural networks with a weight constraint designed so that the assumption in (i) is verified during learning. We study the behavior of entropies and mutual informations throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive.

[1]  Nicolas Macris,et al.  The stochastic interpolation method: A simple scheme to prove replica formulas in Bayesian inference , 2017, ArXiv.

[2]  Sundeep Rangan,et al.  Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing , 2009, NIPS.

[3]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  G. Parisi,et al.  Mean-field equations for spin models with orthogonal interaction matrices , 1995, cond-mat/9503009.

[5]  Antonia Maria Tulino,et al.  Random Matrix Theory and Wireless Communications , 2004, Found. Trends Commun. Inf. Theory.

[6]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[7]  E. Gardner,et al.  Optimal storage properties of neural network models , 1988 .

[8]  Ralf R. Müller,et al.  Vector Precoding for Wireless MIMO Systems and its Replica Analysis , 2007, IEEE Journal on Selected Areas in Communications.

[9]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[10]  Y. Kabashima,et al.  Learning from correlated patterns by simple perceptrons , 2008, 0809.1978.

[11]  Surya Ganguli,et al.  On the Expressive Power of Deep Neural Networks , 2016, ICML.

[12]  J. S. Rowlinson,et al.  PHASE TRANSITIONS , 2021, Topics in Statistical Mechanics.

[13]  Yoshiyuki Kabashima,et al.  Inference from correlated patterns: a unified theory for perceptron learning and linear vector channels , 2007, ArXiv.

[14]  Toshiyuki Tanaka,et al.  A statistical-mechanics approach to large-system analysis of CDMA multiuser detectors , 2002, IEEE Trans. Inf. Theory.

[15]  Nicolas Macris,et al.  The Mutual Information in Random Linear Estimation Beyond i.i.d. Matrices , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[16]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[17]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[18]  N. Macris,et al.  The adaptive interpolation method: a simple scheme to prove replica formulas in Bayesian inference , 2018, Probability Theory and Related Fields.

[19]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[20]  Misha Denil,et al.  ACDC: A Structured Efficient Linear Layer , 2015, ICLR.

[21]  Gal Chechik,et al.  Information Bottleneck for Gaussian Variables , 2003, J. Mach. Learn. Res..

[22]  Nicolas Macris,et al.  Optimal errors and phase transitions in high-dimensional generalized linear models , 2017, Proceedings of the National Academy of Sciences.

[23]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[24]  Andrea Montanari,et al.  Message-passing algorithms for compressed sensing , 2009, Proceedings of the National Academy of Sciences.

[25]  Le Song,et al.  Deep Fried Convnets , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Sundeep Rangan,et al.  Inference in Deep Networks in High Dimensions , 2017, 2018 IEEE International Symposium on Information Theory (ISIT).

[27]  Mikko Vehkaperä,et al.  Signal recovery using expectation consistent approximation for linear observations , 2014, 2014 IEEE International Symposium on Information Theory.

[28]  Andrea Montanari,et al.  High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[29]  Koujin Takeda,et al.  Analysis of CDMA systems that are characterized by eigenvalue spectrum , 2006, ArXiv.

[30]  Marc Lelarge,et al.  Fundamental limits of symmetric low-rank matrix estimation , 2016, Probability Theory and Related Fields.

[31]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[32]  G. Parisi,et al.  Replica field theory for deterministic models: II. A non-random spin glass with glassy behaviour , 1994, cond-mat/9406074.

[33]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[34]  Florent Krzakala,et al.  Multi-layer generalized linear estimation , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[35]  D. Panchenko The Sherrington-Kirkpatrick Model , 2013 .

[36]  E. Gardner The space of interactions in neural network models , 1988 .

[37]  Sundeep Rangan,et al.  Vector approximate message passing , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[38]  M. Opper,et al.  Tractable approximations for probabilistic models: the adaptive Thouless-Anderson-Palmer mean field approach. , 2001, Physical review letters.

[39]  M. Opper,et al.  Advanced mean field methods: theory and practice , 2001 .

[40]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[41]  Florent Krzakala,et al.  Statistical physics of inference: thresholds and algorithms , 2015, ArXiv.

[42]  S. Kak Information, physics, and computation , 1996 .

[43]  Nicolas Macris,et al.  Phase Transitions, Optimal Errors and Optimality of Message-Passing in Generalized Linear Models , 2017, ArXiv.

[44]  D S Dean,et al.  Role of the interaction matrix in mean-field spin glass models. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[46]  Nicolas Macris,et al.  The mutual information in random linear estimation , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[47]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[48]  M. Talagrand Spin glasses : a challenge for mathematicians : cavity and mean field models , 2003 .

[49]  Galen Reeves,et al.  The replica-symmetric prediction for compressed sensing with Gaussian matrices is exact , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[50]  Sompolinsky,et al.  Storing infinite numbers of patterns in a spin-glass model of neural networks. , 1985, Physical review letters.

[51]  Y. Kabashima,et al.  Perceptron capacity revisited: classification ability for correlated patterns , 2007, 0712.4050.

[52]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[53]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[54]  Florent Krzakala,et al.  Statistical physics-based reconstruction in compressed sensing , 2011, ArXiv.

[55]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[56]  David H. Wolpert,et al.  Nonlinear Information Bottleneck , 2017, Entropy.

[57]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[58]  Sundeep Rangan,et al.  Generalized approximate message passing for estimation with random linear mixing , 2010, 2011 IEEE International Symposium on Information Theory Proceedings.

[59]  西森 秀稔 Statistical physics of spin glasses and information processing : an introduction , 2001 .

[60]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  S. Kirkpatrick,et al.  Solvable Model of a Spin-Glass , 1975 .

[62]  M. Mézard The space of interactions in neural networks: Gardner's computation with the cavity method , 1989 .

[63]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[64]  E. Gardner,et al.  Three unfinished works on the optimal storage capacity of networks , 1989 .

[65]  Shlomo Shamai,et al.  Support Recovery With Sparsely Sampled Free Random Matrices , 2011, IEEE Transactions on Information Theory.

[66]  Olivier Marre,et al.  Relevant sparse codes with variational information bottleneck , 2016, NIPS.

[67]  Romain Couillet,et al.  Harnessing neural networks: A random matrix approach , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[68]  G. Parisi,et al.  Replica field theory for deterministic models: I. Binary sequences with low autocorrelation , 1994, hep-th/9405148.

[69]  Nicolas Macris,et al.  The layered structure of tensor estimation and its mutual information , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[70]  Nicolas Macris,et al.  Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula , 2016, NIPS.

[71]  Surya Ganguli,et al.  Statistical mechanics of compressed sensing. , 2010, Physical review letters.

[72]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[73]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[74]  Galen Reeves Additivity of information in multilayer networks via additive Gaussian noise transforms , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[75]  M. Mézard,et al.  Spin Glass Theory and Beyond , 1987 .

[76]  Yoshiyuki Kabashima,et al.  Erratum: A typical reconstruction limit of compressed sensing based on Lp-norm minimization , 2009, ArXiv.