The Performance of Objective Functions for Clustering Categorical Data

Partitioning methods, such as k-means, are popular and useful for clustering. Recently we proposed a new partitioning method for clustering categorical data: using the transfer algorithm to optimize an objective function called within-cluster dispersion. Preliminary experimental results showed that this method outperforms a standard method called k-modes, in terms of the average quality of clustering results. In this paper, we make more advanced efforts to compare the performance of objective functions for categorical data. First we analytically compare the quality of three objective functions: k-medoids, k-modes and within-cluster dispersion. Secondly we measure how well these objectives find true structures in real data sets, by finding their global optima, which we argue is a better measurement than average clustering results. The conclusion is that within-cluster dispersion is generally a better objective for discovering cluster structures. Moreover, we evaluate the performance of various distance measures on within-cluster dispersion, and give some useful observations.

[1]  C. F. Banfield,et al.  Algorithm AS 113: A Transfer for Non-Hierarchical Classification , 1977 .

[2]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[3]  Matus Telgarsky,et al.  Hartigan's Method: k-means Clustering without Voronoi , 2010, AISTATS.

[4]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[5]  Zhengrong Xiang,et al.  The Use of Transfer Algorithm for Clustering Categorical Data , 2013, ADMA.

[6]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[7]  Christos Faloutsos,et al.  Electricity Based External Similarity of Categorical Attributes , 2003, PAKDD.

[8]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[9]  B. Everitt,et al.  Cluster Analysis: Everitt/Cluster Analysis , 2011 .

[10]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[12]  Emmanuel Müller,et al.  Discovering Multiple Clustering Solutions: Grouping Objects in Different Views of the Data , 2010, 2012 IEEE 28th International Conference on Data Engineering.

[13]  Koby Crammer,et al.  Hartigan's K-Means Versus Lloyd's K-Means - Is It Time for a Change? , 2013, IJCAI.

[14]  Douglas Steinley,et al.  K-means clustering: a half-century synthesis. , 2006, The British journal of mathematical and statistical psychology.

[15]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[16]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[17]  D. Steinley Local Optima in K-Means Clustering , 2004 .