Penalty-based cluster validity index for class discovery from cancer data

In order to perform successful diagnosis and treatment of cancer, discovering and classifying cancer types correctly is essential. One of the challenges in cancer class discovery is to estimate the number of classes given a set of unknown microarray data. In the paper, we propose a new cluster validity criterion called Penalty-based Disagreement Index (PDI) based on the perturbation technique to estimate the number of classes in microarray data, PDI not only considers the disagreement between the partition results obtained from the original data and those obtained from the perturbed data, but also includes a penalty measure which is a function of the number of classes. Our experiments show that PDI successfully estimates the true number of classes in a number of challenging real cancer datasets.

[1]  Giorgio Valentini Mosclust: a software library for discovering significant structures in bio-molecular data , 2007, Bioinform..

[2]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[3]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[4]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. Botstein,et al.  Gene expression patterns in human liver cancers. , 2002, Molecular biology of the cell.

[8]  U. Alon,et al.  Transcriptional gene expression profiles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays. , 2001, Cancer research.

[9]  G. Valentini Gene expression Mosclust : a software library for discovering significant structures in biomolecular data , 2007 .

[10]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Giorgio Valentini,et al.  Model order selection for bio-molecular data clustering , 2007, BMC Bioinformatics.

[12]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[13]  C. L. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings: Rejoinder , 1983 .

[14]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[15]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[17]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[18]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[19]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[20]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[21]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[22]  Giorgio Valentini,et al.  Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data , 2006, Bioinform..

[23]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[25]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[26]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[27]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[28]  H. Akaike Prediction and Entropy , 1985 .

[29]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[30]  H J Schultz-Coulon [Objective criteria for the evaluation of the vocal function]. , 1978, Fortschritte der Medizin.