k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values

This paper focuses on solving the problem of clustering for categorical data with missing values. Specifically, we design a new framework that can impute missing values and assign objects into appropriate clusters. For the imputation step, we use a decision tree-based method to fill in missing values. For the clustering step, we use a kernel density estimation approach to define cluster centers and an information theoretic-based dissimilarity measure to quantify the differences between objects. Then, we propose a center-based algorithm for clustering categorical data with missing values, namely k-CCM. An experimental evaluation was performed on real-life datasets with missing values to compare the performance of the proposed algorithm with other popular clustering algorithms in terms of clustering quality. Generally, the experimental result shows that the proposed algorithm has a comparative performance when compared to other algorithms for all datasets.

[1]  Van-Nam Huynh,et al.  A New Context-Based Clustering Framework for Categorical Data , 2018, PRICAI.

[2]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[3]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[4]  Alan Wee-Chung Liew,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[5]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[6]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[7]  Pang-Ning Tan,et al.  Interestingness Measures for Association Patterns : A Perspective , 2000, KDD 2000.

[8]  Doheon Lee,et al.  A k-populations algorithm for clustering categorical data , 2005, Pattern Recognit..

[9]  J. Aitchison,et al.  Multivariate binary discrimination by the kernel method , 1976 .

[10]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[11]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[12]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[13]  Van-Nam Huynh,et al.  A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure , 2016, FoIKS.

[14]  Shengrui Wang,et al.  Central Clustering of Categorical Data with Automated Feature Weighting , 2013, IJCAI.

[15]  Tu Bao Ho,et al.  Cluster-Based Algorithms for Dealing with Missing Values , 2002, PAKDD.

[16]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[17]  Mohamed Zaït,et al.  A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..