Statistical estimation of cluster boundaries in gene expression profile data

MOTIVATION Gene expression profile data are rapidly accumulating due to advances in microarray techniques. The abundant data are analyzed by clustering procedures to extract the useful information about the genes inherent in the data. In the clustering analyses, the systematic determination of the boundaries of gene clusters, instead of by visual inspection and biological knowledge, still remains challenging. RESULTS We propose a statistical procedure to estimate the number of clusters in the hierarchical clustering of the expression profiles. Following the hierarchical clustering, the statistical property of the profiles at the node in the dendrogram is evaluated by a statistics-based value: the variance inflation factor in the multiple regression analysis. The evaluation leads to an automatic determination of the cluster boundaries without any additional analyses and any biological knowledge of the measured genes. The performance of the present procedure is demonstrated on the profiles of 2467 yeast genes, with very promising results. AVAILABILITY A set of programs will be electronically sent upon request. CONTACT horimoto@post.saga-med.ac.jp; toh@beri.co.jp

[1]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[2]  N. Draper,et al.  Applied Regression Analysis , 1966 .

[3]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[4]  P. Brown,et al.  DNA arrays for analysis of gene expression. , 1999, Methods in enzymology.

[5]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[6]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[7]  C. Ball,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Nature.

[8]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[9]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[10]  S. Chatterjee,et al.  Regression Analysis by Example , 1979 .

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Hiroyuki Toh,et al.  Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling , 2002, Bioinform..

[14]  Esko Ukkonen,et al.  Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data , 2000, ISMB.

[15]  Graham J. Wills,et al.  Introduction to graphical modelling , 1995 .

[16]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[17]  P. Brown,et al.  A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. , 1996, Genome research.

[18]  B. Manly Multivariate Statistical Methods : A Primer , 1986 .

[19]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  M. Kendall,et al.  Rank Correlation Methods (5th ed.). , 1992 .

[21]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[23]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[24]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .