An alternative extension of the k-means algorithm for clustering categorical data

Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The k-means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a k-means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two well-known data sets, namely, soybean diseaseand nursery databases.

[1]  D. Cox Note on Grouping , 1957 .

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[4]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[5]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[6]  James C. Bezdek,et al.  A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Ryszard S. Michalski,et al.  Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  James C. Bezdek,et al.  Local convergence of the fuzzy c-Means algorithms , 1986, Pattern Recognit..

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[12]  Michael Spann,et al.  A new approach to clustering , 1990, Pattern Recognit..

[13]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[14]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[15]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[16]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[17]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[18]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[19]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[20]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[21]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[22]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[23]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[24]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.