A dissimilarity measure for the k-Modes clustering algorithm

Clustering is one of the most important data mining techniques that partitions data according to some similarity criterion. The problems of clustering categorical data have attracted much attention from the data mining research community recently. As the extension of the k-Means algorithm, the k-Modes algorithm has been widely applied to categorical data clustering by replacing means with modes. In this paper, the limitations of the simple matching dissimilarity measure and Ng's dissimilarity measure are analyzed using some illustrative examples. Based on the idea of biological and genetic taxonomy and rough membership function, a new dissimilarity measure for the k-Modes algorithm is defined. A distinct characteristic of the new dissimilarity measure is to take account of the distribution of attribute values on the whole universe. A convergence study and time complexity of the k-Modes algorithm based on new dissimilarity measure indicates that it can be effectively used for large data sets. The results of comparative experiments on synthetic data sets and five real data sets from UCI show the effectiveness of the new dissimilarity measure, especially on data sets with biological and genetic taxonomy information.

[1]  Zengyou He,et al.  Squeezer: An efficient algorithm for clustering categorical data , 2008, Journal of Computer Science and Technology.

[2]  Steven J. Fenves,et al.  The formation and use of abstract concepts in design , 1991 .

[3]  Qingsheng Zhu,et al.  Finding key attribute subset in dataset for outlier detection , 2011, Knowl. Based Syst..

[4]  Chung-Chian Hsu Extending attribute-oriented induction algorithm for major values and numeric values , 2004, Expert Syst. Appl..

[5]  Jemal H. Abawajy,et al.  A rough set approach for selecting clustering attribute , 2010, Knowl. Based Syst..

[6]  Ann Q. Gates,et al.  TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2005 .

[7]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[8]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[10]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[11]  Kevin Thompson,et al.  Cobweb/3: A portable implementation , 1990 .

[12]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[13]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[14]  Ryszard S. Michalski,et al.  Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Andrew K. C. Wong,et al.  DECA: A Discrete-Valued Data Clustering Algorithm , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  J. Gower A General Coefficient of Similarity and Some of Its Properties , 1971 .

[17]  Yao Wang,et al.  A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.

[18]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[19]  Andrew K. C. Wong,et al.  A discrete-valued clustering algorithm with applications to biomolecular data , 2001, Inf. Sci..

[20]  Sotiris B. Kotsiantis,et al.  Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data , 2007, International Conference on Computational Intelligence.

[21]  Cungen Cao,et al.  A rough set approach to outlier detection , 2008, Int. J. Gen. Syst..

[22]  Chung-Chian Hsu,et al.  Mining of mixed data with application to catalog marketing , 2007, Expert Syst. Appl..

[23]  Jiye Liang,et al.  Approximation reduction in inconsistent incomplete decision tables , 2010, Knowl. Based Syst..

[24]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[25]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..

[26]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[27]  Gautam Biswas,et al.  Unsupervised Learning with Mixed Numeric and Nominal Data , 2002, IEEE Trans. Knowl. Data Eng..

[28]  Chung-Chian Hsu,et al.  Hierarchical clustering of mixed data based on distance hierarchy , 2007, Inf. Sci..

[29]  M. Pazzani,et al.  Concept formation knowledge and experience in unsupervised learning , 1991 .

[30]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[31]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[32]  Yee Leung,et al.  Maximal consistent block technique for rule acquisition in incomplete information systems , 2003, Inf. Sci..

[33]  Jiye Liang,et al.  A Framework for Clustering Categorical Time-Evolving Data , 2010, IEEE Transactions on Fuzzy Systems.

[34]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[35]  LebowitzMichael Experiments with Incremental Concept Formation , 1987 .

[36]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[37]  Chung-Chian Hsu,et al.  Generalizing self-organizing map for categorical data , 2006, IEEE Transactions on Neural Networks.

[38]  Michael Lebowitz,et al.  Experiments with Incremental Concept Formation: UNIMEM , 1987, Machine Learning.

[39]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[40]  Jennifer Blackhurst,et al.  MMR: An algorithm for clustering categorical data using Rough Set Theory , 2007, Data Knowl. Eng..

[41]  Zengyou He,et al.  Scalable algorithms for clustering large datasets with mixed type attributes , 2005, Int. J. Intell. Syst..

[42]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[43]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[44]  Chun-Bao Chen,et al.  Rough Set-Based Clustering with Refinement Using Shannon's Entropy Theory , 2006, Comput. Math. Appl..

[45]  Jiye Liang,et al.  A new method for measuring uncertainty and fuzziness in rough set theory , 2002, Int. J. Gen. Syst..