A Contrast Pattern Based Clustering Quality Index for Categorical Data

Since clustering is unsupervised and highly explorative, clustering validation (i.e. assessing the quality of clustering solutions) has been an important and long standing research problem. Existing validity measures have significant shortcomings. This paper proposes a novel Contrast Pattern based Clustering Quality index (CPCQ) for categorical data, by utilizing the quality and diversity of the contrast patterns (CPs) which contrast the clusters in clusterings. High quality CPs can characterize clusters and discriminate them against each other. Experiments show that the CPCQ index (1) can recognize that expert-determined classes are the best clusters for many datasets from the UCI repository; (2) does not give inappropriate preference to larger number of clusters; (3) does not require a user to provide a distance function.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[3]  Keke Chen,et al.  The "Best K" for Entropy-based Categorical Data Clustering , 2005, SSDBM.

[4]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[5]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[6]  Jian Pei,et al.  Minimum Description Length Principle: Generators Are Preferable to Closed Patterns , 2006, AAAI.

[7]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[8]  Mihir Bellare,et al.  Free bits, PCPs and non-approximability-towards tight results , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[9]  Mihir Bellare,et al.  Free Bits, PCPs, and Nonapproximability-Towards Tight Results , 1998, SIAM J. Comput..

[10]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[11]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Jinyan Li,et al.  Mining statistically important equivalence classes and delta-discriminative emerging patterns , 2007, KDD '07.

[13]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[14]  Guozhu Dong,et al.  Discovery of Highly Differentiative Gene Groups from Microarray Gene Expression Data Using the Gene Club Approach , 2005, J. Bioinform. Comput. Biol..

[15]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  S HochbaumDorit,et al.  A Best Possible Heuristic for the k-Center Problem , 1985 .

[18]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[19]  Andrzej Lingas,et al.  Efficient approximation algorithms for the Hamming center problem , 1999, SODA '99.

[20]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[21]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[22]  C. Ambroise The EM Algorithm and Extensions, by G.M. McLachlan and T. Krishnan , 1998 .

[23]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[24]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[25]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[26]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.

[27]  Jian Pei,et al.  Mining Succinct Systems of Minimal Generators of Formal Concepts , 2005, DASFAA.

[28]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[29]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..