Applying subclustering and Lp distance in Weighted K-Means with distributed centroids

We consider the Weighted K-Means algorithm with distributed centroids aimed at clustering data sets with numerical, categorical and mixed types of data. Our approach allows given features (i.e., variables) to have different weights at different clusters. Thus, it supports the intuitive idea that features may have different degrees of relevance at different clusters. We use the Minkowski metric in a way that feature weights become feature re-scaling factors for any considered exponent. Moreover, the traditional Silhouette clustering validity index was adapted to deal with both numerical and categorical types of features. Finally, we show that our new method usually outperforms traditional K-Means as well as the recently proposed WK-DC clustering algorithm.

[1]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..

[2]  Wei-Shen Tai,et al.  Apply extended self-organizing map to cluster and classify mixed-type data , 2011, Neurocomputing.

[3]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[4]  Chunguang Zhou,et al.  An improved k-prototypes clustering algorithm for mixed numeric and categorical data , 2013, Neurocomputing.

[5]  R. Bagozzi Advanced Methods of Marketing Research , 1994 .

[6]  Anil K. Jain,et al.  Unsupervised texture segmentation using Gabor filters , 1990, 1990 IEEE International Conference on Systems, Man, and Cybernetics Conference Proceedings.

[7]  Anil K. Jain,et al.  Unsupervised texture segmentation using Gabor filters , 1990, 1990 IEEE International Conference on Systems, Man, and Cybernetics Conference Proceedings.

[8]  Pierre Baldi,et al.  DNA Microarrays and Gene Expression - From Experiments to Data Analysis and Modeling , 2002 .

[9]  Michael K. Ng,et al.  An optimization algorithm for clustering using weighted dissimilarity measures , 2004, Pattern Recognit..

[10]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[11]  Sotirios Chatzis,et al.  A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional , 2011, Expert Syst. Appl..

[12]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[13]  Daoqiang Zhang,et al.  Robust image segmentation using FCM with spatial constraints based on new kernel-induced distance measure , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[14]  Boris G. Mirkin,et al.  Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads , 2010, J. Classif..

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  Charu C. Aggarwal,et al.  A Survey of Text Clustering Algorithms , 2012, Mining Text Data.

[17]  John MacCuish,et al.  Clustering in Bioinformatics and Drug Discovery , 2010 .

[18]  Yvonne Feierabend,et al.  Dna Microarrays And Gene Expression From Experiments To Data Analysis And Modeling , 2016 .

[19]  Xiao Han,et al.  A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data , 2012, Knowl. Based Syst..

[20]  Mirkin Boris,et al.  Clustering: A Data Recovery Approach , 2012 .

[21]  Yunming Ye,et al.  Weighting Method for Feature Selection in K-Means , 2007 .

[22]  Renato Cordeiro de Amorim,et al.  Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering , 2012, Pattern Recognit..

[23]  Vladimir Makarenkov,et al.  Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software , 2001, J. Classif..

[24]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[25]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[26]  Boris G. Mirkin,et al.  MFWK-Means: Minkowski metric Fuzzy Weighted K-Means for high dimensional data clustering , 2013, 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI).

[27]  Renato Cordeiro de Amorim,et al.  An Empirical Evaluation of Different Initializations on the Number of K-Means Iterations , 2012, MICAI.

[28]  L. Hubert,et al.  Comparing partitions , 1985 .