Model order selection for bio-molecular data clustering

BackgroundCluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the "natural" number of clusters underlying the data, and in many cases we have no enough "a priori" biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the "optimal" number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems.ResultsWe propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures).ConclusionThe experimental results show that our model order selection methods are competitive with other state-of-the-art stability based algorithms and are able to detect multiple levels of structure underlying both synthetic and gene expression data.

[1]  Giorgio Valentini Mosclust: a software library for discovering significant structures in bio-molecular data , 2007, Bioinform..

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[4]  Alan P. Sprague,et al.  Reproducible Clusters from Microarray Research: Whither? , 2005, BMC Bioinformatics.

[5]  Ash A. Alizadeh,et al.  Towards a novel classification of human malignancies based on gene expression patterns , 2001, The Journal of pathology.

[6]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[7]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[9]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[10]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[11]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[12]  K K Kidd,et al.  Sequence variability and candidate gene analysis in complex disease: association of mu opioid receptor gene variation with substance dependence. , 2000, Human molecular genetics.

[13]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[16]  Michal Linial,et al.  A functional hierarchical organization of the protein sequence space , 2004, BMC Bioinformatics.

[17]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[18]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[19]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[20]  Michal Linial,et al.  The Advantage of Functional Prediction Based on Clustering of Yeast Genes and Its Correlation with Non-Sequence Based Classifications , 2002, J. Comput. Biol..

[21]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[22]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[23]  Francisco Azuaje,et al.  An integrated tool for microarray data clustering and cluster validity assessment , 2004, SAC '04.

[24]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[25]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[26]  Giorgio Valentini,et al.  Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses , 2006, Artif. Intell. Medicine.

[27]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[28]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[29]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[30]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Giorgio Valentini,et al.  Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data , 2006, Bioinform..

[33]  Susmita Datta,et al.  Comparisons and validation of statistical clustering techniques for microarray gene expression data , 2003, Bioinform..

[34]  Ash A. Alizadeh,et al.  The lymphochip: a specialized cDNA microarray for the genomic-scale analysis of gene expression in normal and malignant lymphocytes. , 1999, Cold Spring Harbor symposia on quantitative biology.