An Efficient Distributed Data Clustering Algorithm

The k-means algorithm is one of the most popular clustering algorithms in use today. The high running time complexity of serial k-means limits its applicability for very large databases. On the other hand, the existing parallel kmeans algorithms demand huge data transfer operations incorporating high communication complexity. Transfer of actual data from local sites is also unacceptable, in many situations, on security and privacy grounds. This work proposes a distributed clustering algorithm that can be executed in a distributed network of processors to achieve significant reduction in computation and communication times while identifying the cluster structure inherent to a data set. The large amount of data transfer, involved in parallel k-means algorithms, is avoided to reduce the communication overhead as well as to ensure better security and privacy. Results of experimentation show that the proposed distributed approach can provide higher speedup than other reported algorithms and can effectively be employed in large applications.

[1]  Mohammed J. Zaki,et al.  Parallel classification for data mining on shared-memory multiprocessors , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[2]  Alva L. Couch,et al.  Parallel K-means Clustering Algorithm on NOWs , 2003 .

[3]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[4]  Malay K. Pakhira,et al.  A Modified k-means Algorithm to Avoid Empty Clusters , 2009 .

[5]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[6]  Xiaobo Li,et al.  Parallel clustering algorithms , 1989, Parallel Comput..

[7]  Jiali Mao,et al.  The Study of Parallel K-Means Algorithm , 2006, 2006 6th World Congress on Intelligent Control and Automation.

[8]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[9]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[10]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[11]  SANGHAMITRA BANDYOPADHYAY,et al.  Clustering Using Simulated Annealing with Probabilistic Redistribution , 2001, Int. J. Pattern Recognit. Artif. Intell..

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Ujjwal Maulik,et al.  A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification , 2005, Fuzzy Sets Syst..

[14]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[15]  Hans-Peter Kriegel,et al.  Effective and efficient distributed model-based clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[16]  Ruoming Jin,et al.  Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[17]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[18]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Shi-Jinn Horng,et al.  Parallel clustering algorithms on a reconfigurable array of processors with wider bus networks , 1997, Proceedings 1997 International Conference on Parallel and Distributed Systems.

[20]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[21]  Anil K. Jain,et al.  A VLSI Systolic Architecture for Pattern Clustering , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.