论文信息 - Clustering mixed numerical and categorical data with missing values - 字舞流文

Clustering mixed numerical and categorical data with missing values

Abstract This paper proposes a novel framework for clustering mixed numerical and categorical data with missing values. It integrates the imputation and clustering steps into a single process, which results in an algorithm named C lustering M ixed Numerical and Categorical Data with M issing Values (k-CMM). The algorithm consists of three phases. The initialization phase splits the input dataset into two parts: missing values and attribute types. The imputation phase uses the decision-tree-based method to find the set of correlated data objects. The clustering phase uses the mean and kernel-based methods to form cluster centers at numerical and categorical attributes, respectively. The algorithm also uses the squared Euclidean and information-theoretic-based dissimilarity measure to compute the distances between objects and cluster centers. An extensive experimental evaluation was conducted on real-life datasets to compare the clustering quality of k-CMM with state-of-the-art clustering algorithms. The execution time, memory usage, and scalability of k-CMM for various numbers of clusters or data sizes were also evaluated. Experimental results show that k-CMM can efficiently cluster missing mixed datasets as well as outperform other algorithms when the number of missing values increases in the datasets.

Van-Nam Huynh | Songsak Sriboonchitta | Duy-Tai Dinh | S. Sriboonchitta | V. Huynh | Duy-Tai Dinh

[1] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[2] Marek Gagolewski,et al. Genie+OWA: Robustifying hierarchical clustering with OWA-based linkages , 2020, Inf. Sci..

[3] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[4] Zhexue Huang,et al. CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[5] Van-Nam Huynh,et al. k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values , 2018, MDAI.

[6] Alan Wee-Chung Liew,et al. Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[7] Pang-Ning Tan,et al. Interestingness Measures for Association Patterns : A Perspective , 2000, KDD 2000.

[8] Zeshui Xu,et al. An agglomerative hierarchical clustering algorithm for linear ordinal rankings , 2021, Inf. Sci..

[9] Yangyang Li,et al. Self-representation based dual-graph regularized feature selection clustering , 2016, Neurocomputing.

[10] Witold Pedrycz,et al. Knowledge-based clustering - from data to information granules , 2007 .

[11] Yan Wang,et al. A multi-stage hierarchical clustering algorithm based on centroid of tree and cut edge constraint , 2021, Inf. Sci..

[12] J. Aitchison,et al. Multivariate binary discrimination by the kernel method , 1976 .

[13] Jin-Yin Chen,et al. A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data , 2016, Inf. Sci..

[14] Jiye Liang,et al. Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[15] Feng Jiang,et al. Initialization of K-modes clustering using outlier detection techniques , 2016, Inf. Sci..

[16] Edward R. Dougherty,et al. Optimal clustering with missing values , 2018, BMC Bioinformatics.

[17] Van-Nam Huynh,et al. k-PbC: an improved cluster center initialization for categorical data clustering , 2020, Applied Intelligence.

[18] David M. W. Powers,et al. Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[19] Shuyuan Yang,et al. Feature selection based dual-graph sparse non-negative matrix factorization for local discriminative clustering , 2018, Neurocomputing.

[20] José Cristóbal Riquelme Santos,et al. External clustering validity index based on chi-squared statistical test , 2019, Inf. Sci..

[21] Ivor W. Tsang,et al. Spectral Embedded Clustering: A Framework for In-Sample and Out-of-Sample Spectral Clustering , 2011, IEEE Transactions on Neural Networks.

[22] Joydeep Ghosh,et al. Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[23] Ronghua Shang,et al. A Spatial Fuzzy Clustering Algorithm With Kernel Metric Based on Immune Clone for SAR Image Segmentation , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[24] Sirisup Laohakiat,et al. An incremental density-based clustering framework using fuzzy local clustering , 2021, Inf. Sci..

[25] Shuyuan Yang,et al. Global discriminative-based nonnegative spectral clustering , 2016, Pattern Recognit..

[26] Pavel Berkhin,et al. A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[27] Ohn Mar San,et al. An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[28] Shuzhi Sam Ge,et al. Small traffic sign detection from large image , 2019, Applied Intelligence.

[29] Julia Hirschberg,et al. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[30] Mohamed Zaït,et al. A comparative study of clustering methods , 1997, Future Gener. Comput. Syst..

[31] Van-Nam Huynh,et al. A New Context-Based Clustering Framework for Categorical Data , 2018, PRICAI.

[32] Tao Zhou,et al. Hierarchical Clustering Supported by Reciprocal Nearest Neighbors , 2019, Inf. Sci..

[33] Van-Nam Huynh,et al. An efficient algorithm for mining periodic high-utility sequential patterns , 2018, Applied Intelligence.

[34] Akira Notsu,et al. Objective function-based rough membership C-means clustering , 2021, Inf. Sci..

[35] Van-Nam Huynh,et al. An efficient algorithm for Hiding High Utility Sequential Patterns , 2018, Int. J. Approx. Reason..