Clustering Gene Expression Data by Mutual Information with Gene Function

We introduce a simple on-line algorithm for clustering paired samples of continuous and discrete data. The clusters are defined in the continuous data space and become local there, while within-cluster differences between the associated, implicitly estimated conditional distributions of the discrete variable are minimized. The discrete variable can be seen as an indicator of relevance or importance guiding the clustering. Minimization of the Kullback-Leibler divergence-based distortion criterion is equivalent to maximization of the mutual information between the generated clusters and the discrete variable. We apply the method to a time series data set, i.e. yeast gene expressions measured with DNA chips, with biological knowledge about the functions of the genes encoded into the discrete variable.

[1]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[2]  Samuel Kaski,et al.  Bankruptcy analysis with self-organizing maps in learning metrics , 2001, IEEE Trans. Neural Networks.

[3]  Francis D. Gibbons,et al.  Judging the quality of gene expression-based clustering methods using gene annotation. , 2002, Genome research.

[4]  Kanti V. Mardia,et al.  Statistics of Directional Data , 1972 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[7]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[8]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[10]  K. Mardia Statistics of Directional Data , 1972 .

[11]  Samuel Kaski,et al.  Data Visualization and Analysis with Self-Organizing Maps in Learning Metrics , 2001, DaWaK.

[12]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Suzanna Becker,et al.  Mutual information maximization: models of cortical self-organization. , 1996, Network.

[14]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[15]  Trevor Hastie,et al.  Flexible discriminant and mixture models , 2000 .

[16]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[17]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.