Information based clustering: Supplementary material

This technical report provides the supplementary material for a paper entitled "Information based clustering", to appear shortly in Proceedings of the National Academy of Sciences (USA). In Section I we present in detail the iterative clustering algorithm used in our experiments and in Section II we describe the validation scheme used to determine the statistical significance of our results. Then in subsequent sections we provide all the experimental results for three very different applications: the response of gene expression in yeast to different forms of environmental stress, the dynamics of stock prices in the Standard and Poor's 500, and viewer ratings of popular movies. In particular, we highlight some of the results that seem to deserve special attention. All the experimental results and relevant code, including a freely available web application, can be found at this http URL .

[1]  W. Bialek,et al.  Information-based clustering. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[2]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[3]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[4]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[5]  G. Church,et al.  Identifying regulatory networks by combinatorial analysis of promoter elements , 2001, Nature Genetics.

[6]  R. Durrett Probability: Theory and Examples , 1993 .

[7]  A. Owen,et al.  A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[11]  William Bialek,et al.  Estimating mutual information and multi-information in large networks , 2005, ArXiv.

[12]  Satoru Miyano,et al.  Open source clustering software , 2004 .

[13]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[16]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[17]  G. Woan The Cambridge Handbook of Physics Formulas , 2000 .