A method for design of data-tailored partitioning algorithms for optimizing the number of clusters in microarray analysis

We propose a method for designing a partitioning clustering algorithm from reusable components that is suitable for finding the appropriate number of clusters (K) in microarray data. The proposed method is evaluated on 10 datasets (4 syntetic and 6 real-word microarrays) by considering 1008 reusable-component-based algorithms and four normalization methods. The best performing algorithm were reported on every dataset and also rules were identified for designing microarray-specific clustering algorithms. The obtained results indicate that in the majority of cases a data-tailored clustering algorithm design outperforms the results reported in the literature. In addition, data normalization can have an important influence on algorithm performance. The method proposed in this paper gives insights for design of divisive clustering algorithms that can reveal the optimal K in a microarray dataset.

[1]  Adil M. Bagirov,et al.  New algorithms for multi-class cancer diagnosis using tumor gene expression signatures , 2003, Bioinform..

[2]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Zoran Obradovic,et al.  Internal Evaluation Measures as Proxies for External Indices in Clustering Gene Expression Data , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[4]  Kathrin Kirchner,et al.  Reusable components for partitioning clustering algorithms , 2009, Artificial Intelligence Review.

[5]  J. Downing,et al.  Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[6]  Greg Hamerly,et al.  Learning the k in k-means , 2003, NIPS.

[7]  Ivan G. Costa,et al.  Mining Rules for the Automatic Selection Process of Clustering Methods Applied to Cancer Gene Expression Data , 2009, ICANN.

[8]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[9]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[11]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[12]  Alexander Schliep,et al.  Ranking and selecting clustering algorithms using a meta-learning approach , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[13]  Rui Xu,et al.  Clustering Algorithms in Biomedical Research: A Review , 2010, IEEE Reviews in Biomedical Engineering.

[14]  Davide Risso,et al.  A novel approach to the clustering of microarray data via nonparametric density estimation , 2011, BMC Bioinformatics.

[15]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[16]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[17]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[18]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[19]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[21]  Kate Smith-Miles,et al.  Towards insightful algorithm selection for optimisation using meta-learning concepts , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[22]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.