Ensemble based Distributed K-Modes Clustering

Clustering has been recognized as the unsupervised classification of data items into groups. Due to the explosion in the number of autonomous data sources, there is an emergent need for effective approaches in distributed clustering. The distributed clustering algorithm is used to cluster the distributed datasets without gathering all the data in a single site. The K-Means is a popular clustering method owing to its simplicity and speed in clustering large datasets. But it fails to handle directly the datasets with categorical attributes which are generally occurred in real life datasets. Huang proposed the K-Modes clustering algorithm by introducing a new dissimilarity measure to cluster categorical data. This algorithm replaces means of clusters with a frequency based method which updates modes in the clustering process to minimize the cost function. Most of the distributed clustering algorithms found in the literature seek to cluster numerical data. In this paper, a novel Ensemble based Distributed K-Modes clustering algorithm is proposed, which is well suited to handle categorical data sets as well as to perform distributed clustering process in an asynchronous manner. The performance of the proposed algorithm is compared with the existing distributed K-Means clustering algorithms, and K-Modes based Centralized Clustering algorithm. The experiments are carried out for various datasets of UCI machine learning data repository.

[1]  Hillol Kargupta,et al.  Approximate Distributed K-Means Clustering over a Peer-to-Peer Network , 2009, IEEE Transactions on Knowledge and Data Engineering.

[2]  Paul E. Green,et al.  K-modes Clustering , 2001, J. Classif..

[3]  Joydeep Ghosh,et al.  Distributed Clustering with Limited Knowledge Sharing , 2022 .

[4]  K. Thangavel,et al.  Ensemble based distributed soft clustering , 2008, 2008 International Conference on Computing, Communication and Networking.

[5]  Jiye Liang,et al.  A weighting k-modes algorithm for subspace clustering of categorical data , 2013, Neurocomputing.

[6]  Michalis Vazirgiannis,et al.  Clustering validity checking methods: part II , 2002, SGMD.

[7]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[8]  Michael K. Ng,et al.  A Note on K-modes Clustering , 2003, J. Classif..

[9]  Grigorios Tsoumakas,et al.  Distributed Data Mining , 2009, Encyclopedia of Data Warehousing and Mining.

[10]  K. Thangavel,et al.  An Intuitionistic Fuzzy Approach to Distributed Fuzzy Clustering , 2010 .

[11]  Bin Wang,et al.  Coercion: A Distributed Clustering Algorithm for Categorical Data , 2013, 2013 Ninth International Conference on Computational Intelligence and Security.

[12]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[13]  K. Thangavel,et al.  Distributed Data Clustering: A Comparative Analysis , 2009, Foundations of Computational Intelligence.

[14]  Lawrence O. Hall,et al.  Scalable clustering: a distributed approach , 2004, 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542).

[15]  Jiye Liang,et al.  The k-modes type clustering plus between-cluster information for categorical data , 2014, Neurocomputing.

[16]  Bin Liu,et al.  Privacy preserving clustering over distributed data , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[17]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[18]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[19]  K. Thangavel,et al.  Ensemble based Distributed K-Harmonic Means Clustering , 2009 .

[20]  Lawrence O. Hall,et al.  A scalable framework for cluster ensembles , 2009, Pattern Recognit..

[21]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[22]  Genlin Ji,et al.  Ensemble Learning Based Distributed Clustering , 2007, PAKDD Workshops.

[23]  Yi Li,et al.  COOLCAT: an entropy-based algorithm for categorical clustering , 2002, CIKM '02.

[24]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Mohamed S. Kamel,et al.  Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization , 2009, IEEE Transactions on Knowledge and Data Engineering.

[26]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[27]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[28]  Zengyou He,et al.  Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode , 2005, CIS.

[29]  Omar S. Soliman,et al.  A Bio Inspired Fuzzy K-Modes Clustring Algorithm , 2012, ICONIP.

[30]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..

[31]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[32]  Georgios B. Giannakis,et al.  Distributed Clustering Using Wireless Sensor Networks , 2011, IEEE Journal of Selected Topics in Signal Processing.

[33]  Zhexue Huang,et al.  CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES , 1997 .

[34]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[35]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[36]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Mohamed S. Kamel,et al.  Models of distributed data clustering in peer-to-peer environments , 2012, Knowledge and Information Systems.

[38]  K. Thangavel,et al.  Distributed Clustering for Data Sources with Diverse Schema , 2008, 2008 Third International Conference on Convergence and Hybrid Information Technology.

[39]  Malay K. Pakhira Clustering Large Databases in Distributed Environment , 2009, 2009 IEEE International Advance Computing Conference.

[40]  Ujjwal Maulik,et al.  Clustering distributed data streams in peer-to-peer environments , 2006, Inf. Sci..

[41]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[42]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..