CPCQ: Contrast pattern based clustering quality index for categorical data

Clustering validation is concerned with assessing the quality of clustering solutions. Since clustering is unsupervised and highly explorative, clustering validation has been an important and long standing research problem. Existing validity measures, including entropy-based and distance-based indices, have significant shortcomings. Indeed, for many datasets from the UCI repository, they fail to recognize that the expert-determined classes are the best clusters and they frequently give preference to clusterings with larger number of clusters. Their weakness reflects their inability to accurately capture intra-cluster coherence and inter-cluster separation. This paper proposes a novel Contrast Pattern based Clustering Quality index (CPCQ) for categorical data, by utilizing the quality and diversity of the contrast patterns, which contrast the clusters in given clusterings. High quality contrast patterns can serve to characterize the clusters and discriminate one cluster against the others. The CPCQ index is based on the rationale that a high-quality clustering should have many diversified high-quality contrast patterns among its clusters. The quality of individual contrast patterns is defined in terms of their length, support, and the length of their corresponding closed pattern. The quality measure concerning ''many diversified'' contrast patterns is defined in terms of the quality and diversity of some selected groups of contrast patterns with minimal overlap among contrast patterns and groups in terms of items and matching transactions. Experiments show that the CPCQ index (1) does not require a user to provide a distance function; (2) does not give inappropriate preference to larger number of clusters; (3) can recognize that expert-determined classes are the best clusters for many datasets from the UCI repository.

[1]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[2]  Jinyan Li,et al.  Mining statistically important equivalence classes and delta-discriminative emerging patterns , 2007, KDD '07.

[3]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[4]  Jian Pei,et al.  Mining Succinct Systems of Minimal Generators of Formal Concepts , 2005, DASFAA.

[5]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[6]  Guozhu Dong,et al.  Discovery of Highly Differentiative Gene Groups from Microarray Gene Expression Data Using the Gene Club Approach , 2005, J. Bioinform. Comput. Biol..

[7]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[10]  Mihir Bellare,et al.  Free Bits, PCPs, and Nonapproximability-Towards Tight Results , 1998, SIAM J. Comput..

[11]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[12]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[13]  Jian Pei,et al.  Minimum Description Length Principle: Generators Are Preferable to Closed Patterns , 2006, AAAI.

[14]  Andrzej Lingas,et al.  Efficient approximation algorithms for the Hamming center problem , 1999, SODA '99.

[15]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[16]  Keke Chen,et al.  The "Best K" for Entropy-based Categorical Data Clustering , 2005, SSDBM.

[17]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[18]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[22]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[23]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[24]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[25]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[26]  Ke Wang,et al.  Clustering transactions using large items , 1999, CIKM '99.