KMODES : a modified k-modes clustering algorithm

In this paper we present a new method for clustering categorical data sets named CL.E.KMODES. The proposed method is a modified k-modes algorithm that incorporates a new four-step dissimilarity measure, which is based on elements of the methodological framework of the ELECTRE I multicriteria method. The four-step dissimilarity measure introduces an alternative and more accurate way of assigning objects to clusters. In particular, it compares each object with each mode, for every attribute that they have in common, and then chooses the most appropriate mode and its corresponding cluster for that object. Seven widely used data sets are tested to verify the robustness of the proposed method in six clustering evaluation measures.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[3]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[4]  Michael K. Ng,et al.  Clustering categorical data sets using tabu search techniques , 2002, Pattern Recognit..

[5]  T. P. Burnaby On a method for character weighting a similarity coefficient, employing the concept of information , 1970 .

[6]  Zijiang Yang,et al.  A Genetic k-Modes Algorithm for Clustering Categorical Data , 2005, ADMA.

[7]  Zhengxin Chen,et al.  An iterative initial-points refinement algorithm for categorical data clustering , 2002, Pattern Recognit. Lett..

[8]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[9]  Panos M. Pardalos,et al.  Encyclopedia of Optimization , 2006 .

[10]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[11]  Ke Wang,et al.  Proceedings of the Eighth SIAM International Conference on Data Mining , 2008, SDM 2008.

[12]  E. S. Smirnov On Exact Methods in Systematics , 1968 .

[13]  Sergios Theodoridis,et al.  Pattern Recognition , 1998, IEEE Trans. Neural Networks.

[14]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[15]  B. Roy The outranking approach and the foundations of electre methods , 1991 .

[16]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[17]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[18]  Michael K. Ng,et al.  Automated variable weighting in k-means type clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[20]  Vladimir Makarenkov,et al.  Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software , 2001, J. Classif..

[21]  S French,et al.  Multicriteria Methodology for Decision Aiding , 1996 .

[22]  M. Pirlot A Common Framework for Describing Some Outranking Methods , 1997 .

[23]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[24]  Mei-Ling Shyu,et al.  Handling missing values via decomposition of the conditioned set , 2005, IRI -2005 IEEE International Conference on Information Reuse and Integration, Conf, 2005..

[25]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[26]  Harald Cramér The elements of probability theory and some of its applications , 1955 .

[27]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[29]  Richard C. Dubes,et al.  Cluster Analysis and Related Issues , 1993, Handbook of Pattern Recognition and Computer Vision.

[30]  Sushil Jajodia,et al.  Applications of Data Mining in Computer Security , 2002, Advances in Information Security.

[31]  Karl Pearson,et al.  ON THE GENERAL THEORY OF MULTIPLE CONTINGENCY WITH SPECIAL REFERENCE TO PARTIAL CONTINGENCY , 1916 .

[32]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[33]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  K. Maung,et al.  MEASUREMENT OF ASSOCIATION IN A CONTINGENCY TABLE WITH SPECIAL REFERENCE TO THE PIGMENTATION OF HAIR AND EYE COLOURS OF SCOTTISH SCHOOL CHILDREN , 1941 .

[35]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[36]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[37]  R. Siegler Three aspects of cognitive development , 1976, Cognitive Psychology.

[38]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[39]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[40]  David L. Waltz,et al.  Toward memory-based reasoning , 1986, CACM.

[41]  M. Bohanec,et al.  KNOWLEDGE ACQUISITION AND EXPLANATION FOR MULTI-ATTRIBUTE DECISION MAKING ∗ , 1988 .

[42]  Mohamed Nadif,et al.  Clustering Large Categorical Data , 2002, PAKDD.

[43]  Bernard Roy,et al.  Classement et choix en présence de points de vue multiples , 1968 .

[44]  Tu Bao Ho,et al.  for categorical data , 2005 .

[45]  Denis Bouyssou,et al.  Outranking Methods , 2009, Encyclopedia of Optimization.

[46]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[47]  Matthias Ehrgott,et al.  Multiple criteria decision analysis: state of the art surveys , 2005 .

[48]  Zengyou He,et al.  Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode , 2005, CIS.

[49]  Lipika Dey,et al.  A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[50]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.