k-PbC: an improved cluster center initialization for categorical data clustering

The performance of a partitional clustering algorithm is influenced by the initial random choice of cluster centers. Different runs of the clustering algorithm on the same data set often yield different results. This paper addresses that challenge by proposing an algorithm named k-PbC, which takes advantage of non-random initialization from the view of pattern mining to improve clustering quality. Specifically, k-PbC first performs a maximal frequent itemset mining approach to find a set of initial clusters. It then uses a kernel-based method to form cluster centers and an information-theoretic based dissimilarity measure to estimate the distance between cluster centers and data objects. An extensive experimental study was performed on various real categorical data sets to draw a comparison between k-PbC and state-of-the-art categorical clustering algorithms in terms of clustering quality. Comparative results have revealed that the proposed initialization method can enhance clustering results and k-PbC outperforms compared algorithms for both internal and external validation metrics. Graphical Abstract k-PbC algorithm for categorical data clustering k-PbC algorithm for categorical data clustering

[1]  Shengrui Wang,et al.  Central Clustering of Categorical Data with Automated Feature Weighting , 2013, IJCAI.

[2]  Zhang Yi,et al.  A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE , 2019, Knowl. Based Syst..

[3]  Xiang Lin,et al.  FGCH: a fast and grid based clustering algorithm for hybrid data stream , 2018, Applied Intelligence.

[4]  Zhang Yi,et al.  Clustering Categorical Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[6]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[7]  M. Cugmas,et al.  On comparing partitions , 2015 .

[8]  Ada Wai-Chee Fu,et al.  Mining frequent itemsets without support threshold: with and without item constraints , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[10]  Luis E. Zárate,et al.  Categorical data clustering: What similarity measure to recommend? , 2015, Expert Syst. Appl..

[11]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[12]  Hamid Parvin,et al.  A fuzzy clustering ensemble based on cluster clustering and iterative Fusion of base clusters , 2019, Applied intelligence (Boston).

[13]  Philippe Fournier-Viger,et al.  A survey of itemset mining , 2017, WIREs Data Mining Knowl. Discov..

[14]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[15]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[16]  Van-Nam Huynh,et al.  A New Context-Based Clustering Framework for Categorical Data , 2018, PRICAI.

[17]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[18]  Huu Hiep Nguyen,et al.  Clustering Categorical Data Using Community Detection Techniques , 2017, Comput. Intell. Neurosci..

[19]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-modes clustering , 2013, Expert Syst. Appl..

[20]  Patricio A. Vela,et al.  A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm , 2012, Expert Syst. Appl..

[21]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[23]  Doheon Lee,et al.  A k-populations algorithm for clustering categorical data , 2005, Pattern Recognit..

[24]  R L dos SantosTiago,et al.  Categorical data clustering , 2015 .

[25]  Jiye Liang,et al.  A cluster centers initialization method for clustering categorical data , 2012, Expert Syst. Appl..

[26]  Van-Nam Huynh,et al.  A method for k-means-like clustering of categorical data , 2019, Journal of Ambient Intelligence and Humanized Computing.

[27]  Van-Nam Huynh,et al.  An efficient algorithm for Hiding High Utility Sequential Patterns , 2018, Int. J. Approx. Reason..

[28]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning , 2008 .

[29]  Hamido Fujita,et al.  A study of graph-based system for multi-view clustering , 2019, Knowl. Based Syst..

[30]  Van-Nam Huynh,et al.  Mining Periodic High Utility Sequential Patterns , 2017, ACIIDS.

[31]  H. Fujita,et al.  An Approach of Clustering Features for Ranked Nations of E-government 2012 , 2014 .

[32]  Lifei Chen A probabilistic framework for optimizing projected clusters with categorical attributes , 2014, Science China Information Sciences.

[33]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[34]  Charu C. Aggarwal An Introduction to Cluster Analysis , 2013, Data Clustering: Algorithms and Applications.

[35]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[36]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[37]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[38]  Bhanukiran Vinzamuri,et al.  A Survey of Partitional and Hierarchical Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[39]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[40]  Hamido Fujita,et al.  Low-rank local tangent space embedding for subspace clustering , 2020, Inf. Sci..

[41]  J. Aitchison,et al.  Multivariate binary discrimination by the kernel method , 1976 .

[42]  Feng Jiang,et al.  Initialization of K-modes clustering using outlier detection techniques , 2016, Inf. Sci..

[43]  Van-Nam Huynh,et al.  A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure , 2016, FoIKS.

[44]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[45]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[46]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[47]  Van-Nam Huynh,et al.  An efficient algorithm for mining periodic high-utility sequential patterns , 2018, Applied Intelligence.

[48]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[49]  G. Grahne,et al.  High Performance Mining of Maximal Frequent Itemsets Gösta , 2003 .

[50]  Van-Nam Huynh,et al.  Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient , 2019, Communications in Computer and Information Science.

[51]  Hoai Bac Le,et al.  A pure array structure and parallel strategy for high-utility sequential pattern mining , 2018, Expert Syst. Appl..

[52]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[53]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[54]  Van-Nam Huynh,et al.  k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values , 2018, MDAI.