Evaluation of clustering algorithms for gene expression data

BackgroundCluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist.ResultsIn this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated.ConclusionNo single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms.

[1]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[2]  Susmita Datta,et al.  SOME COMPARISONS OF CLUSTERING AND CLASSIFICATION TECHNIQUES APPLIED TO TRANSCRIPTIONAL PROFILING DATA , 2002 .

[3]  Maurice K. Wong,et al.  Algorithm AS136: A k-means clustering algorithm. , 1979 .

[4]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[5]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[6]  Michael Q. Zhang,et al.  Current Topics in Computational Molecular Biology , 2002 .

[7]  W. H. Piel,et al.  An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. , 2004, Molecular biology and evolution.

[8]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[9]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[10]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[11]  H. Mewes,et al.  The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. , 2004, Nucleic acids research.

[12]  William N. Venables,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[13]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[14]  Colin Clarke,et al.  NanoSNP: A computational platform for high throughput Quantum Dot encoded microsphere SNP genotyping , 2005, BMC Bioinformatics.

[15]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[16]  Brian D. Ripley,et al.  Modern Applied Statistics with S-Plus Second edition , 1997 .

[17]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[18]  Keith Baggerly,et al.  Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression , 2004, Breast Cancer Research.

[19]  Susmita Datta,et al.  Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes , 2006, BMC Bioinformatics.

[20]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[21]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .