Information Bottleneck for Gaussian Variables

The problem of extracting the relevant aspects of data was addressed through the information bottleneck (IB) method, by (soft) clustering one variable while preserving information about another - relevance - variable. An interesting question addressed in the current work is the extension of these ideas to obtain continuous representations that preserve relevant information, rather than discrete clusters. We give a formal definition of the general continuous IB problem and obtain an analytic solution for the optimal representation for the important case of multivariate Gaussian variables. The obtained optimal representation is a noisy linear projection to eigenvectors of the normalized correlation matrix ∑x|y∑−1x, which is also the basis obtained in Canonical Correlation Analysis. However, in Gaussian IB, the compression tradeoff parameter uniquely determines the dimension, as well as the scale of each eigenvector. This introduces a novel interpretation where solutions of different ranks lie on a continuum parametrized by the compression level. Our analysis also provides an analytic expression for the optimal tradeoff - the information curve - in terms of the eigenvalue spectrum.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  H. Hotelling The most predictable criterion. , 1935 .

[3]  A. Jennings,et al.  Simultaneous Iteration for Partial Eigensolution of Real Matrices , 1975 .

[4]  Aaron D. Wyner,et al.  On source coding with side information at the decoder , 1975, IEEE Trans. Inf. Theory.

[5]  Aaron D. Wyner,et al.  The rate-distortion function for source coding with side information at the decoder , 1976, IEEE Trans. Inf. Theory.

[6]  Gene H. Golub,et al.  Matrix computations , 1983 .

[7]  B. Thompson Canonical Correlation Analysis: Uses and Interpretation , 1984 .

[8]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[9]  J. Magnus,et al.  Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised Edition) , 1999 .

[10]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[11]  Ralph Linsker,et al.  Local Synaptic Learning Rules Suffice to Maximize Mutual Information in a Linear Network , 1992, Neural Computation.

[12]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[13]  S. Kay Fundamentals of statistical signal processing: estimation theory , 1993 .

[14]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[15]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[16]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[17]  H. Knutsson,et al.  Learning Canonical Correlations , 1997 .

[18]  Gunnar Rätsch,et al.  Invariant Feature Extraction and Classification in Kernel Spaces , 1999, NIPS.

[19]  Toby Berger,et al.  A semi-continuous version of the Berger-Yeung problem , 1999, IEEE Trans. Inf. Theory.

[20]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[21]  H. Storch,et al.  Statistical Analysis in Climate Research , 2000 .

[22]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[23]  A. Dimitrov,et al.  Neural coding and decoding: communication channels and quantization , 2001, Network.

[24]  Vivek K. Goyal,et al.  Theoretical foundations of transform coding , 2001, IEEE Signal Process. Mag..

[25]  Steven Kay,et al.  Fundamentals Of Statistical Signal Processing , 2001 .

[26]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[27]  Noam Slonim,et al.  Maximum Likelihood and the Information Bottleneck , 2002, NIPS.

[28]  Gal Chechik,et al.  Extracting Relevant Structures with Side Information , 2002, NIPS.

[29]  Kannan Ramchandran,et al.  Duality between source coding and channel coding and its extension to the side information case , 2003, IEEE Trans. Inf. Theory.

[30]  Michael Gastpar,et al.  To code, or not to code: lossy source-channel communication revisited , 2003, IEEE Trans. Inf. Theory.

[31]  Naftali Tishby,et al.  An Information Theoretic Tradeoff between Complexity and Accuracy , 2003, COLT.

[32]  Hans Knutsson,et al.  Adaptive analysis of fMRI data , 2003, NeuroImage.

[33]  Naftali Tishby,et al.  Sufficient Dimensionality Reduction , 2003, J. Mach. Learn. Res..