Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient

The problem of estimating the number of clusters (say k) is one of the major challenges for the partitional clustering. This paper proposes an algorithm named k-SCC to estimate the optimal k in categorical data clustering. For the clustering step, the algorithm uses the kernel density estimation approach to define cluster centers. In addition, it uses an information-theoretic based dissimilarity to measure the distance between centers and objects in each cluster. The silhouette analysis based approach is then used to evaluate the quality of different clusterings obtained in the former step to choose the best k. Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k-SCC with three other algorithms. Experimental results show that k-SCC outperforms the compared algorithms in determining the number of clusters for each dataset.

[1]  Fionn Murtagh,et al.  Handbook of Cluster Analysis , 2015 .

[2]  Van-Nam Huynh,et al.  A New Context-Based Clustering Framework for Categorical Data , 2018, PRICAI.

[3]  Luis E. Zárate,et al.  Categorical data clustering: What similarity measure to recommend? , 2015, Expert Syst. Appl..

[4]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[5]  Van-Nam Huynh,et al.  k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values , 2018, MDAI.

[6]  Shengrui Wang,et al.  Central Clustering of Categorical Data with Automated Feature Weighting , 2013, IJCAI.

[7]  Van-Nam Huynh,et al.  A method for k-means-like clustering of categorical data , 2019, Journal of Ambient Intelligence and Humanized Computing.

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Hedieh Sajedi,et al.  A novel clustering algorithm based on data transformation approaches , 2017, Expert Syst. Appl..

[10]  Bhanukiran Vinzamuri,et al.  A Survey of Partitional and Hierarchical Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[11]  Van-Nam Huynh,et al.  A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure , 2016, FoIKS.

[12]  Petros Xanthopoulos,et al.  Estimating the number of clusters in a dataset via consensus clustering , 2019, Expert Syst. Appl..

[13]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[14]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[15]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[16]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[17]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[18]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[19]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.