Understanding Convolutional Neural Network Training with Information Theory

Using information theoretic concepts to understand and explore the inner organization of deep neural networks (DNNs) remains a big challenge. Recently, the concept of an information plane began to shed light on the analysis of multilayer perceptrons (MLPs). We provided an in-depth insight into stacked autoencoders (SAEs) using a novel matrix-based Renyi's {\alpha}-entropy functional, enabling for the first time the analysis of the dynamics of learning using information flow in real-world scenario involving complex network architecture and large data. Despite the great potential of these past works, there are several open questions when it comes to applying information theoretic concepts to understand convolutional neural networks (CNNs). These include for instance the accurate estimation of information quantities among multiple variables, and the many different training methodologies. By extending the novel matrix-based Renyi's {\alpha}-entropy functional to a multivariate scenario, this paper presents a systematic method to analyze CNNs training using information theory. Our results validate two fundamental data processing inequalities in CNNs, and also have direct impacts on previous work concerning the training and design of CNNs.

[1]  A. Rényi On Measures of Entropy and Information , 1961 .

[2]  Martin E. Hellman,et al.  Probability of error, equivocation, and the Chernoff bound , 1970, IEEE Trans. Inf. Theory.

[3]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[6]  A. J. Bell THE CO-INFORMATION LATTICE , 2003 .

[7]  Rajendra Bhatia,et al.  Infinitely Divisible Matrices , 2006, Am. Math. Mon..

[8]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[9]  Randall D. Beer,et al.  Nonnegative Decomposition of Multivariate Information , 2010, ArXiv.

[10]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[11]  Christof Koch,et al.  Quantifying synergistic mutual information , 2012, ArXiv.

[12]  Benjamin Flecker,et al.  Synergy, redundancy, and multivariate information measures: an experimentalist’s perspective , 2014, Journal of Computational Neuroscience.

[13]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[14]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[15]  Eckehard Olbrich,et al.  Quantifying unique information , 2013, Entropy.

[16]  David J. Schwab,et al.  An exact mapping between the Variational Renormalization Group and Deep Learning , 2014, ArXiv.

[17]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[18]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[19]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Naren Ramakrishnan,et al.  Flow of Information in Feed-Forward Deep Neural Networks , 2016, ArXiv.

[22]  Qingming Huang,et al.  Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks , 2015, ECCV.

[23]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[26]  Antonino Staiano,et al.  Intrinsic dimension estimation: Advances and open problems , 2016, Inf. Sci..

[27]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[28]  Murray Shanahan,et al.  The Partial Information Decomposition of Generative Neural Network Models , 2017, Entropy.

[29]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[30]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[31]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[32]  E. Learned-Miller,et al.  Reducing Duplicate Filters in Deep Neural Networks , 2018 .

[33]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[34]  Sergio Verdú,et al.  Arimoto–Rényi Conditional Entropy and Bayesian $M$ -Ary Hypothesis Testing , 2017, IEEE Transactions on Information Theory.

[35]  Robert Jenssen,et al.  Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional , 2020, IEEE transactions on pattern analysis and machine intelligence.

[36]  José Carlos Príncipe,et al.  Understanding Autoencoders with Information Theoretic Concepts , 2018, Neural Networks.