A Comparative Study of Cluster Validation Indices Applied to Genotyping Data

Abstract Clustering is the most important task in unsupervised learning and cluster validation plays a very important role in cluster analysis. In this paper, we compared the performance of 7 major validation indices designed for Fuzzy- c Means: Partition Coefficient (PC), Partition Entropy (PE), Fukuyama-Sugeno index (F-S), Xie and Beni index (X-B), Compose Within and Between scattering (CWB), SC and Fuzzy hyper volume (FHV) on genotyping data obtained from single nucleotide polymorphism analysis. We first find there are three factors (the fuzzy factor m , the number of variables p and the maximum number of clusters c max ) that may influence validation indices' performance. A validation scheme was designed to optimize the performance of these indices. Finally, we test the indices on a total of 18 datasets and compared their performance. PC and CWB showed the best overall performance. CWB only failed on one dataset and PC failed on 2.

[1]  Pekka Teppola,et al.  Possibilistic and fuzzy C‐means clustering for process monitoring in an activated sludge waste‐water treatment plant , 1999 .

[2]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[3]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  K. Mullis,et al.  Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction. , 1986, Cold Spring Harbor symposia on quantitative biology.

[5]  Weixin Xie,et al.  Suppressed fuzzy c-means clustering algorithm , 2003, Pattern Recognit. Lett..

[6]  Noureddine Zahid,et al.  A new cluster-validity for fuzzy clustering , 1999, Pattern Recognit..

[7]  N. Schork,et al.  Single nucleotide polymorphisms and the future of genetic epidemiology , 2000, Clinical genetics.

[8]  J. Bezdek Cluster Validity with Fuzzy Sets , 1973 .

[9]  Richard G. Brereton,et al.  Chemometrics: Data Analysis for the Laboratory and Chemical Plant , 2003 .

[10]  Th. Förster Zwischenmolekulare Energiewanderung und Fluoreszenz , 1948 .

[11]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[12]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[13]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[14]  James M. Keller,et al.  The possibilistic C-means algorithm: insights and recommendations , 1996, IEEE Trans. Fuzzy Syst..

[15]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[16]  Pierre Hansen,et al.  Fuzzy J-Means: a new heuristic for fuzzy clustering , 2001, Pattern Recognit..

[17]  James C. Bezdek,et al.  Correction to "On Cluster Validity for the Fuzzy c-Means Model" [Correspondence] , 1997, IEEE Trans. Fuzzy Syst..

[18]  R. Brereton,et al.  Genotyping using single nucleotide polymorphism, fluorescence spectroscopy and pattern recognition. , 2004, The Analyst.

[19]  J. C. Peters,et al.  Fuzzy Cluster Analysis : A New Method to Predict Future Cardiac Events in Patients With Positive Stress Tests , 1998 .

[20]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[21]  C. B. Lucasius,et al.  On k-medoid clustering of large data sets with the aid of a genetic algorithm: background, feasiblity and comparison , 1993 .

[22]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..