Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels

Analyzing deep neural networks (DNNs) via information plane (IP) theory has gained tremendous attention recently as a tool to gain insight into, among others, their generalization ability. However, it is by no means obvious how to estimate mutual information (MI) between each hidden layer and the input/desired output, to construct the IP. For instance, hidden layers with many neurons require MI estimators with robustness towards the high dimensionality associated with such layers. MI estimators should also be able to naturally handle convolutional layers, while at the same time being computationally tractable to scale to large networks. None of the existing IP methods to date have been able to study truly deep Convolutional Neural Networks (CNNs), such as the e.g.\ VGG-16. In this paper, we propose an IP analysis using the new matrix--based R\'enyi's entropy coupled with tensor kernels over convolutional layers, leveraging the power of kernel methods to represent properties of the probability distribution independently of the dimensionality of the data. The obtained results shed new light on the previous literature concerning small-scale DNNs, however using a completely new approach. Importantly, the new framework enables us to provide the first comprehensive IP analysis of contemporary large-scale DNNs and CNNs, investigating the different training phases and providing new insights into the training dynamics of large-scale neural networks.

[1]  Johan A. K. Suykens,et al.  A kernel-based framework to tensorial data analysis , 2011, Neural Networks.

[2]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[3]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[4]  A. Rényi On Measures of Entropy and Information , 1961 .

[5]  Brian Kingsbury,et al.  Estimating Information Flow in Deep Neural Networks , 2018, ICML.

[6]  Cian O'Donnell,et al.  Adaptive Estimators Show Information Compression in Deep Neural Networks , 2019, ICLR.

[7]  Shenghua Gao,et al.  Evaluating Capability of Deep Neural Networks for Image Classification via Information Plane , 2018, ECCV.

[8]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[10]  Isaac L. Chuang,et al.  Quantum Computation and Quantum Information (10th Anniversary edition) , 2011 .

[11]  Artemy Kolchinsky,et al.  Caveats for information bottleneck in deterministic scenarios , 2018, ICLR.

[12]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[13]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[14]  Chong-Ho Choi,et al.  Input Feature Selection by Mutual Information Based on Parzen Window , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  José Carlos Príncipe,et al.  Information Theoretic Learning with Infinitely Divisible Kernels , 2013, ICLR.

[18]  Alfred O. Hero,et al.  Scalable Mutual Information Estimation Using Dependence Graphs , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  José Carlos Príncipe,et al.  Understanding Autoencoders with Information Theoretic Concepts , 2018, Neural Networks.

[20]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[21]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[22]  Mikhail Belkin,et al.  DATA SPECTROSCOPY: EIGENSPACES OF CONVOLUTION OPERATORS AND CLUSTERING , 2008, 0807.3719.

[23]  Robert Jenssen,et al.  Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional , 2020, IEEE transactions on pattern analysis and machine intelligence.

[24]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[25]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[26]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[27]  Thomas M. Cover,et al.  Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[28]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[30]  Rajendra Bhatia,et al.  Infinitely Divisible Matrices , 2006, Am. Math. Mon..

[31]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[32]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[33]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[34]  Rana Ali Amjad,et al.  Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.