Understanding Autoencoders with Information Theoretic Concepts

Despite their great success in practical applications, there is still a lack of theoretical and systematic methods to analyze deep neural networks. In this paper, we illustrate an advanced information theoretic methodology to understand the dynamics of learning and the design of autoencoders, a special type of deep learning architectures that resembles a communication channel. By generalizing the information plane to any cost function, and inspecting the roles and dynamics of different layers using layer-wise information quantities, we emphasize the role that mutual information plays in quantifying learning from data. We further suggest and also experimentally validate, for mean square error training, three fundamental properties regarding the layer-wise flow of information and intrinsic dimensionality of the bottleneck layer, using respectively the data processing inequality and the identification of a bifurcation point in the information plane that is controlled by the given data. Our observations have direct impact on the optimal design of autoencoders, the design of alternative feedforward training methods, and even in the problem of generalization.

[1]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[2]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[3]  T. Liggett Interacting Particle Systems , 1985 .

[4]  Linda G. Shapiro,et al.  Modeling Stylized Character Expressions via Deep Learning , 2016, ACCV.

[5]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[6]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[7]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[8]  L. Goddard Information Theory , 1962, Nature.

[9]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[10]  Alessandro Rozza,et al.  Minimum Neighbor Distance Estimators of Intrinsic Dimension , 2011, ECML/PKDD.

[11]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[12]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[13]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[18]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[19]  Jose C. Principe,et al.  Information Theoretic Learning - Renyi's Entropy and Kernel Perspectives , 2010, Information Theoretic Learning.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[22]  Alessandro Rozza,et al.  DANCo: An intrinsic dimensionality estimator exploiting angle and norm concentration , 2014, Pattern Recognit..

[23]  Artemy Kolchinsky,et al.  Estimating Mixture Entropy with Pairwise Distances , 2017, Entropy.

[24]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[25]  Antonino Staiano,et al.  Intrinsic dimension estimation: Advances and open problems , 2016, Inf. Sci..

[26]  Yousef Saad,et al.  Trace optimization and eigenproblems in dimension reduction methods , 2011, Numer. Linear Algebra Appl..

[27]  Razvan Pascanu,et al.  On the number of response regions of deep feed forward networks with piece-wise linear activations , 2013, 1312.6098.

[28]  F. Takens Detecting strange attractors in turbulence , 1981 .

[29]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[30]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[31]  A. Rényi On Measures of Entropy and Information , 1961 .

[32]  Badong Chen,et al.  Universal Approximation with Convex Optimization: Gimmick or Reality? [Discussion Forum] , 2015, IEEE Computational Intelligence Magazine.

[33]  Jose C. Principe,et al.  Measures of Entropy From Data Using Infinitely Divisible Kernels , 2012, IEEE Transactions on Information Theory.

[34]  Ralph Linsker,et al.  How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals , 1989, Neural Computation.

[35]  Robert Jenssen,et al.  Multivariate Extension of Matrix-based Renyi's α-order Entropy Functional , 2020, IEEE transactions on pattern analysis and machine intelligence.

[36]  Eder Santana,et al.  Autoencoders trained with relevant information: Blending Shannon and Wiener's perspectives , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Neri Merhav,et al.  Data Processing Theorems and the Second Law of Thermodynamics , 2010, IEEE Transactions on Information Theory.

[38]  Rajendra Bhatia,et al.  Infinitely Divisible Matrices , 2006, Am. Math. Mon..

[39]  Che-Wei Huang,et al.  Flow of Renyi information in deep neural networks , 2016, 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP).

[40]  T M Li Ge Te Interacting Particle Systems , 2013 .

[41]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[42]  I. Csiszár A class of measures of informativity of observation channels , 1972 .

[43]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[44]  Dapeng Oliver Wu,et al.  Why Deep Learning Works: A Manifold Disentanglement Perspective , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[45]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[46]  David J. Schwab,et al.  An exact mapping between the Variational Renormalization Group and Deep Learning , 2014, ArXiv.

[47]  V. Kvasnicka,et al.  Neural and Adaptive Systems: Fundamentals Through Simulations , 2001, IEEE Trans. Neural Networks.

[48]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Venu Govindaraju,et al.  Why Regularized Auto-Encoders learn Sparse Representation? , 2015, ICML.

[50]  J. S. Marron,et al.  A scale-based approach to finding effective dimensionality in manifold learning , 2007, 0710.5349.

[51]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[52]  Heng Zhang,et al.  Mutual information-based RBM neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[53]  Jose C. Principe,et al.  Breaker status uncovered by autoencoders under unsupervised maximum mutual information training , 2013 .

[54]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[55]  José Carlos Príncipe,et al.  Rate-Distortion Auto-Encoders , 2013, ICLR.

[56]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[57]  Jason Yosinski,et al.  Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks , 2016, ArXiv.

[58]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[59]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[60]  Alexander Binder,et al.  Evaluating the Visualization of What a Deep Neural Network Has Learned , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[62]  Jakob Hoydis,et al.  An Introduction to Deep Learning for the Physical Layer , 2017, IEEE Transactions on Cognitive Communications and Networking.

[63]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[64]  D. Vere-Jones Markov Chains , 1972, Nature.

[65]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[66]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[67]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[68]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[69]  Quanshi Zhang,et al.  Interpreting CNN knowledge via an Explanatory Graph , 2017, AAAI.

[70]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[71]  José Carlos Príncipe,et al.  Training MLPs layer-by-layer with the information potential , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[72]  Michel Verleysen,et al.  Kernel-based dimensionality reduction using Renyi's α-entropy measures of similarity , 2017, Neurocomputing.

[73]  Badong Chen,et al.  System Parameter Identification: Information Criteria and Algorithms , 2013 .

[74]  M. K. Ali,et al.  Neural networks for estimating intrinsic dimension. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[75]  S. Stigler Gauss and the Invention of Least Squares , 1981 .

[76]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[77]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[78]  Christopher J. Rozell,et al.  Stable Takens' Embeddings for Linear Dynamical Systems , 2010, IEEE Transactions on Signal Processing.

[79]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[80]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[81]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[82]  S. P. Luttrell,et al.  A Bayesian Analysis of Self-Organizing Maps , 1994, Neural Computation.

[83]  Yoshua Bengio,et al.  How Auto-Encoders Could Provide Credit Assignment in Deep Networks via Target Propagation , 2014, ArXiv.

[84]  Changsheng Xu,et al.  Understanding Deep Learning Generalization by Maximum Entropy , 2017, ArXiv.

[85]  A. Pinkus Ridge Functions: Approximation Algorithms , 2015 .

[86]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.

[87]  Aram Galstyan,et al.  Efficient Estimation of Mutual Information for Strongly Dependent Variables , 2014, AISTATS.

[88]  Naren Ramakrishnan,et al.  Flow of Information in Feed-Forward Deep Neural Networks , 2016, ArXiv.

[89]  Yasuaki Kuroe,et al.  A learning method of nonlinear mappings by neural networks with considering their derivatives , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[90]  Ibrahim M. Alabdulmohsin An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability , 2017, AISTATS.

[91]  A. Kraskov,et al.  Estimating mutual information. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[92]  Maciej Krawczak Multilayer Neural Networks: A Generalized Net Perspective , 2013 .

[93]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[94]  S. Jansen,et al.  On the notion(s) of duality for Markov processes , 2012, 1210.7193.

[95]  Nicky J Welton,et al.  Value of Information , 2015, Medical decision making : an international journal of the Society for Medical Decision Making.

[96]  S. Haykin,et al.  Adaptive Filter Theory , 1986 .