Evaluation of gene-expression clustering via mutual information distance measure

BackgroundThe definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pearson correlation coefficient.ResultsRelying on several public gene expression datasets, we evaluate the homogeneity and separation scores of different clustering solutions. It was found that the use of the MI measure yields a more significant differentiation among erroneous clustering solutions. The proposed measure was also used to analyze the performance of several known clustering algorithms. A comparative study of these algorithms reveals that their "best solutions" are ranked almost oppositely when using different distance measures, despite the found correspondence between these measures when analysing the averaged scores of groups of solutions.ConclusionIn view of the results, further attention should be paid to the selection of a proper distance measure for analyzing the clustering of gene expression data.

[1]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[2]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[3]  C. Müller,et al.  Large-scale clustering of cDNA-fingerprinting data. , 1999, Genome research.

[4]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[5]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[6]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[7]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[8]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[9]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[10]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[11]  G F V Glonek,et al.  Factorial and time course designs for cDNA microarray experiments. , 2004, Biostatistics.

[12]  A. Brazma,et al.  Gene expression data analysis. , 2001, FEBS letters.

[13]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[14]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[15]  D. Botstein,et al.  Gene expression patterns in human liver cancers. , 2002, Molecular biology of the cell.

[16]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[17]  Jill P. Mesirov,et al.  GeneCluster 2.0: an advanced toolset for bioarray analysis , 2004, Bioinform..

[18]  Simon Lin,et al.  Methods of microarray data analysis III , 2002 .

[19]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[20]  W. Gardner,et al.  Carcinogenesis , 1961, The Yale Journal of Biology and Medicine.

[21]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[22]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[23]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[24]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[25]  Carsten O. Daub,et al.  Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data , 2004, BMC Bioinformatics.

[26]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[27]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[28]  G S Michaels,et al.  Cluster analysis and data visualization of large-scale gene expression data. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[30]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[31]  J. Kleinjans,et al.  Discrimination of genotoxic from non-genotoxic carcinogens by gene expression profiling. , 2004, Carcinogenesis.

[32]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[33]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[34]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[35]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[36]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[38]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[39]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[40]  Raya Khanin,et al.  Methods of Microarray Data Analysis V , 2007 .

[41]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[42]  Brian Everitt,et al.  Cluster analysis , 1974 .

[43]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[44]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[45]  P. Cosman,et al.  Quantitative classification and natural clustering of Caenorhabditis elegans behavioral phenotypes. , 2003, Genetics.

[46]  Volker Brendel,et al.  Multi-query sequence BLAST output examination with MuSeqBox , 2001, Bioinform..

[47]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[48]  Michael Q. Zhang,et al.  Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data , 2002 .

[49]  A. Schuster,et al.  Tumor classification by gene expression profiling: comparison and validation of five clustering methods , 2001, SIGB.

[50]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[51]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[52]  I. Grosse,et al.  MEASURING CORRELATIONS IN SYMBOL SEQUENCES , 1995 .