The mutual information: Detecting and evaluating dependencies between variables

MOTIVATION Clustering co-expressed genes usually requires the definition of 'distance' or 'similarity' between measured datasets, the most common choices being Pearson correlation or Euclidean distance. With the size of available datasets steadily increasing, it has become feasible to consider other, more general, definitions as well. One alternative, based on information theory, is the mutual information, providing a general measure of dependencies between variables. While the use of mutual information in cluster analysis and visualization of large-scale gene expression data has been suggested previously, the earlier studies did not focus on comparing different algorithms to estimate the mutual information from finite data. RESULTS Here we describe and review several approaches to estimate the mutual information from finite datasets. Our findings show that the algorithms used so far may be quite substantially improved upon. In particular when dealing with small datasets, finite sample effects and other sources of potentially misleading results have to be taken into account.

[1]  A. Brazma,et al.  Gene expression data analysis. , 2001, FEBS letters.

[2]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[3]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .

[4]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[5]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[6]  Moon,et al.  Estimation of mutual information using kernel density estimators. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[7]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  P. Grassberger Finite sample corrections to entropy and dimension estimates , 1988 .

[9]  Adam Arkin,et al.  On the deduction of chemical reaction pathways from measurements of time series of concentrations. , 2001, Chaos.

[10]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[11]  P. Rapp,et al.  The algorithmic complexity of neural spike trains increases during focal seizures , 1994, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[12]  James Theiler,et al.  Testing for nonlinearity in time series: the method of surrogate data , 1992 .

[13]  M. Roulston Estimating the errors on measured entropy and mutual information , 1999 .

[14]  G S Michaels,et al.  Cluster analysis and data visualization of large-scale gene expression data. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15]  T. Schreiber,et al.  Surrogate time series , 1999, chao-dyn/9909037.

[16]  W. Ebeling,et al.  Finite sample effects in sequence analysis , 1994 .

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[19]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[20]  Andrei N. Kolmogorov,et al.  Logical basis for information theory and probability theory , 1968, IEEE Trans. Inf. Theory.

[21]  Fraser,et al.  Independent coordinates for strange attractors from mutual information. , 1986, Physical review. A, General physics.