Maximally Informative Hierarchical Representations of High-Dimensional Data

We consider a set of probabilistic functions of some input variables as a representation of the inputs. We present bounds on how informative a representation is about input data. We extend these bounds to hierarchical representations so that we can quantify the contribution of each layer towards capturing the information in the original data. The special form of these bounds leads to a simple, bottom-up optimization procedure to construct hierarchical representations that are also maximally informative about the data. This optimization has linear computational complexity and constant sample complexity in the number of variables. These results establish a new approach to unsupervised learning of deep representations that is both principled and practical. We demonstrate the usefulness of the approach on both synthetic and real-world data.

[1]  Michael Satosi Watanabe,et al.  Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  A. Walden,et al.  The Econometric Modelling of Financial Time Series. , 1995 .

[6]  M. Studený,et al.  The Multiinformation Function as a Tool for Measuring Stochastic Dependence , 1998, Learning in Graphical Models.

[7]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[8]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[9]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[10]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[11]  P. Pedroni The Econometric Modelling of Financial Time Series , 2001 .

[12]  Michael J. Berry,et al.  Synergy, Redundancy, and Independence in Population Codes , 2003, The Journal of Neuroscience.

[13]  Alexander Kraskov,et al.  Published under the scientific responsability of the EUROPEAN PHYSICAL SOCIETY Incorporating , 2002 .

[14]  P. Latham,et al.  Synergy, Redundancy, and Independence in Population Codes, Revisited , 2005, The Journal of Neuroscience.

[15]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[16]  Naftali Tishby,et al.  Multivariate Information Bottleneck , 2001, Neural Computation.

[17]  Randall D. Beer,et al.  Nonnegative Decomposition of Multivariate Information , 2010, ArXiv.

[18]  Christof Koch,et al.  Quantifying synergistic mutual information , 2012, ArXiv.

[19]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Tengfei Liu,et al.  A Survey on Latent Tree Models and Applications , 2013, J. Artif. Intell. Res..

[21]  Joshua A. Grochow,et al.  A framework for optimal high-level descriptions in science and engineering - preliminary report , 2014, ArXiv.

[22]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[23]  Fei Sha,et al.  Demystifying Information-Theoretic Clustering , 2013, ICML.

[24]  Joshua A. Grochow,et al.  Optimal high-level descriptions of dynamical systems , 2014 .

[25]  Aram Galstyan,et al.  Discovering Structure in High-Dimensional Data Through Correlation Explanation , 2014, NIPS.

[26]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[27]  Nihat Ay,et al.  Information-theoretic inference of common ancestors , 2010, Entropy.

[28]  Paul M. Thompson,et al.  Information-theoretic characterization of blood panel predictors for brain atrophy and cognitive decline in the elderly , 2015, 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI).