Initial Seed Selection for Mixed Data Using Modified K-means Clustering Algorithm

Data sets to which clustering is applied may be homogeneous (numerical or categorical) or heterogeneous (numerical and categorical) in nature. Handling homogeneous data is easier than heterogeneous data. We propose a novel technique for identifying initial seeds for heterogeneous data clustering, through the introduction of a unique distance measure where the distance of the numerical attributes is scaled such that it is comparable to that of categorical attributes. The proposed initial seed selection algorithm ensures selection of initial seed points from different clusters of the clustering solution which are then given as input to the modified K-means clustering algorithm along with the data set. This technique is independent of any user-defined parameter and thus can be easily applied to clusterable data sets with mixed attributes. We have also modified the K-means clustering algorithm to handle mixed attributes by incorporating our novel distance measure to handle numerical data and assigned the value one or zero when categorical data is dissimilar or similar. Finally, a comparison has been made with existing algorithms to bring out the significance of our approach. We also perform a statistical test to evaluate the statistical significance of our proposed technique.

[1]  Kalyani Desikan,et al.  A Simple Density with Distance Based Initial Seed Selection Technique for K Means Algorithm , 2017, J. Comput. Inf. Technol..

[2]  Chunguang Zhou,et al.  An improved k-prototypes clustering algorithm for mixed numeric and categorical data , 2013, Neurocomputing.

[3]  Md Zahidul Islam,et al.  ModEx and Seed-Detective: Two novel techniques for high quality clustering by using good initial seeds in K-Means , 2015, J. King Saud Univ. Comput. Inf. Sci..

[4]  Chen Jinyin,et al.  A novel cluster center fast determination clustering algorithm , 2017 .

[5]  Xiao Han,et al.  A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data , 2012, Knowl. Based Syst..

[6]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[7]  Maoguo Gong,et al.  Unsupervised evolutionary clustering algorithm for mixed type data , 2010, IEEE Congress on Evolutionary Computation.

[8]  Nikhil R. Pal,et al.  Clustering of Mixed Data by Integrating Fuzzy, Probabilistic, and Collaborative Clustering Framework , 2016, Int. J. Fuzzy Syst..

[9]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[10]  Kalyani Desikan,et al.  Initial seed selection for K-modes clustering - A distance and density based approach , 2018, J. King Saud Univ. Comput. Inf. Sci..

[11]  Sotirios Chatzis,et al.  A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional , 2011, Expert Syst. Appl..

[12]  Renato Cordeiro de Amorim,et al.  Applying subclustering and Lp distance in Weighted K-Means with distributed centroids , 2016, Neurocomputing.

[13]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.