Biological and statistical evaluation of clusterings of gene expression profiles

Recent research on large-scale analysis of gene expression data explored the use of clustering algorithms to find groups of genes with similar expression profiles. An open problem is that it is not clear how the clusterings from different algorithms should be compared. We therefore discuss potential methods for evaluation of clusterings, including both context-independent statistical methods and methods based on systematic comparison of clusterings with gene annotation. Through a comparison study on example expression data, we find that context-independent evaluation methods may give misleading results, since they do not correspond well with annotation-based methods. We compare the context-independent methods (compactness and isolation) with a correlation-based evaluation method. We also propose a method based on computing the relative entropy of each cluster, under the hypothesis that biologically significant clusters should deviate significantly from the background distribution of the data set. This method may be used with any annotation that gives a classification of the genes, and it can be used when only some of the genes are annotated. We evaluate the method using enzyme classification as annotation in a data set where half of the genes are annotated with enzyme classes.

[1]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[3]  The Chemistry of Life , 1944, Nature.

[4]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[5]  Satoru Miyano,et al.  Algorithms for identifying Boolean networks and related biological networks based on matrix multiplication and fingerprint function , 2000, RECOMB '00.

[6]  P. Raskin,et al.  Rosiglitazone short-term monotherapy lowers fasting and post-prandial glucose in patients with Type II diabetes , 2000, Diabetologia.

[7]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[8]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[9]  G S Michaels,et al.  Cluster analysis and data visualization of large-scale gene expression data. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  M. Q. Zhang,et al.  Cluster, function and promoter: analysis of yeast expression array. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[11]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[12]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[13]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[15]  N. Socci,et al.  Leptin-specific patterns of gene expression in white adipose tissue. , 2000, Genes & development.

[16]  Jan Komorowski,et al.  Predicting Gene Function from Gene Expressions and Ontologies , 2000, Pacific Symposium on Biocomputing.

[17]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[18]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[19]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.