The "Best K" for Entropy-based Categorical Data Clustering

With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of clusters for a categorical dataset? Since the categorical data does not have the inherent distance function as the similarity measure, the traditional cluster validation techniques based on the geometry shape and density distribution cannot be applied to answer this question. In this paper, we investigate the entropy property of the categorical data and propose a BkPlot method for determining a set of candidate “best Ks”. This method is implemented with a hierarchical clustering algorithm HierEntro. The experimental result shows that our approach can effectively identify the significant clustering structures. keywords Categorical Data Clustering, Entropy, Cluster Validation

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  F. B. Baulieu Two Variant Axiom Systems for Presence/Absence Based Dissimilarity Coefficients , 1997 .

[3]  Sushil Jajodia,et al.  Applications of Data Mining in Computer Security , 2002, Advances in Information Security.

[4]  Andreas D. Baxevanis,et al.  Bioinformatics - a practical guide to the analysis of genes and proteins , 2001, Methods of biochemical analysis.

[5]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[6]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[7]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[8]  Renée J. Miller,et al.  LIMBO: Scalable Clustering of Categorical Data , 2004, EDBT.

[9]  Tao Li,et al.  Entropy-based criterion in categorical clustering , 2004, ICML.

[10]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[11]  Neil Wrigley,et al.  Categorical Data Analysis for Geographers and Environmental Scientists , 1985 .

[12]  Subhash Sharma Applied multivariate techniques , 1995 .

[13]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[14]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[15]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[16]  Matthew Brand,et al.  An Entropic Estimator for Structure Discovery , 1998, NIPS.

[17]  Philip S. Yu,et al.  Finding Localized Associations in Market Basket Data , 2002, IEEE Trans. Knowl. Data Eng..

[18]  Alan Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[19]  Keke Chen,et al.  VISTA: Validating and Refining Clusters Via Visualization , 2004, Inf. Vis..

[20]  Hui-Rong Qian,et al.  Book Review: Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edition , 2005 .

[21]  Otto Optiz,et al.  Conceptual and Numerical Analysis of Data , 1989 .

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[24]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[26]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[27]  Hans-Hermann Bock,et al.  Probabilistic Aspects in Cluster Analysis , 1989 .

[28]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[29]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[30]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[31]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[32]  G. Celeux,et al.  Clustering criteria for discrete data and latent class models , 1991 .