The k-modes Algorithm with Entropy Based Similarity Coefficient

Abstract Clustering is the process of organizing dataset into isolated groups such that data points in the same are more similar and data points of different groups are more dissimilar. The k-modes algorithm well known for its simplicity is a popular partitioning algorithm for clustering categorical data. In this paper, we discuss the limitations of distance function used in this algorithm with an illustrative example and then we propose a similarity coefficient based on Information Entropy. We analyze the time complexity of the k-modes algorithm with proposed similarity coefficient. The main advantage of this coefficient is that it improves the clustering accuracy while retaining scalability of the k-modes algorithm. We perform the scalability tests on synthetic datasets.

[1]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[2]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[3]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[4]  Johan A. K. Suykens,et al.  Optimized Data Fusion for Kernel k-Means Clustering , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Liang Bai,et al.  A dissimilarity measure for the k-Modes clustering algorithm , 2012, Knowl. Based Syst..

[6]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Alfred O. Hero,et al.  Clustering with a new distance measure based on a dual-rooted tree , 2013, Inf. Sci..

[10]  Jiaqi Liu,et al.  A novel clustering method on time series data , 2011, Expert Syst. Appl..

[11]  Weixin Xie,et al.  An Efficient Global K-means Clustering Algorithm , 2011, J. Comput..

[12]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[13]  徐晓飞,et al.  Squeezer:An Efficient Algorithm for Clustering Categorical Data , 2002 .

[14]  Dervis Karaboga,et al.  A novel clustering approach: Artificial Bee Colony (ABC) algorithm , 2011, Appl. Soft Comput..

[15]  LiMark Junjie,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007 .