Clustering mixed numerical and categorical data with missing values

Abstract This paper proposes a novel framework for clustering mixed numerical and categorical data with missing values. It integrates the imputation and clustering steps into a single process, which results in an algorithm named C lustering M ixed Numerical and Categorical Data with M issing Values (k-CMM). The algorithm consists of three phases. The initialization phase splits the input dataset into two parts: missing values and attribute types. The imputation phase uses the decision-tree-based method to find the set of correlated data objects. The clustering phase uses the mean and kernel-based methods to form cluster centers at numerical and categorical attributes, respectively. The algorithm also uses the squared Euclidean and information-theoretic-based dissimilarity measure to compute the distances between objects and cluster centers. An extensive experimental evaluation was conducted on real-life datasets to compare the clustering quality of k-CMM with state-of-the-art clustering algorithms. The execution time, memory usage, and scalability of k-CMM for various numbers of clusters or data sizes were also evaluated. Experimental results show that k-CMM can efficiently cluster missing mixed datasets as well as outperform other algorithms when the number of missing values increases in the datasets.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Marek Gagolewski,et al.  Genie+OWA: Robustifying hierarchical clustering with OWA-based linkages , 2020, Inf. Sci..

[3]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[4]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[5]  Van-Nam Huynh,et al.  k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values , 2018, MDAI.

[6]  Alan Wee-Chung Liew,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[7]  Pang-Ning Tan,et al.  Interestingness Measures for Association Patterns : A Perspective , 2000, KDD 2000.

[8]  Zeshui Xu,et al.  An agglomerative hierarchical clustering algorithm for linear ordinal rankings , 2021, Inf. Sci..

[9]  Yangyang Li,et al.  Self-representation based dual-graph regularized feature selection clustering , 2016, Neurocomputing.

[10]  Witold Pedrycz,et al.  Knowledge-based clustering - from data to information granules , 2007 .

[11]  Yan Wang,et al.  A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint , 2021, Inf. Sci..

[12]  J. Aitchison,et al.  Multivariate binary discrimination by the kernel method , 1976 .

[13]  Jin-Yin Chen,et al.  A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data , 2016, Inf. Sci..

[14]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[15]  Feng Jiang,et al.  Initialization of K-modes clustering using outlier detection techniques , 2016, Inf. Sci..

[16]  Edward R. Dougherty,et al.  Optimal clustering with missing values , 2018, BMC Bioinformatics.

[17]  Van-Nam Huynh,et al.  k-PbC: an improved cluster center initialization for categorical data clustering , 2020, Applied Intelligence.

[18]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[19]  Shuyuan Yang,et al.  Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering , 2018, Neurocomputing.

[20]  José Cristóbal Riquelme Santos,et al.  External clustering validity index based on chi-squared statistical test , 2019, Inf. Sci..

[21]  Ivor W. Tsang,et al.  Spectral Embedded Clustering: A Framework for In-Sample and Out-of-Sample Spectral Clustering , 2011, IEEE Transactions on Neural Networks.

[22]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[23]  Ronghua Shang,et al.  A Spatial Fuzzy Clustering Algorithm With Kernel Metric Based on Immune Clone for SAR Image Segmentation , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[24]  Sirisup Laohakiat,et al.  An incremental density-based clustering framework using fuzzy local clustering , 2021, Inf. Sci..

[25]  Shuyuan Yang,et al.  Global discriminative-based nonnegative spectral clustering , 2016, Pattern Recognit..

[26]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[27]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[28]  Shuzhi Sam Ge,et al.  Small traffic sign detection from large image , 2019, Applied Intelligence.

[29]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[30]  Mohamed Zaït,et al.  A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..

[31]  Van-Nam Huynh,et al.  A New Context-Based Clustering Framework for Categorical Data , 2018, PRICAI.

[32]  Tao Zhou,et al.  Hierarchical Clustering Supported by Reciprocal Nearest Neighbors , 2019, Inf. Sci..

[33]  Van-Nam Huynh,et al.  An efficient algorithm for mining periodic high-utility sequential patterns , 2018, Applied Intelligence.

[34]  Akira Notsu,et al.  Objective function-based rough membership C-means clustering , 2021, Inf. Sci..

[35]  Van-Nam Huynh,et al.  An efficient algorithm for Hiding High Utility Sequential Patterns , 2018, Int. J. Approx. Reason..