A Parallel K-Means Clustering Algorithm with MPI

Clustering is one of the most popular methods for data analysis, which is prevalent in many disciplines such as image segmentation, bioinformatics, pattern recognition and statistics etc. The most popular and simplest clustering algorithm is K-means because of its easy implementation, simplicity, efficiency and empirical success. However, the real-world applications produce huge volumes of data, thus, how to efficiently handle of these data in an important mining task has been a challenging and significant issue. In addition, MPI (Message Passing Interface) as a programming model of message passing presents high performances, scalability and portability. Motivated by this, a parallel K-means clustering algorithm with MPI, called MKmeans, is proposed in this paper. The algorithm enables applying the clustering algorithm effectively in the parallel environment. Experimental study demonstrates that MKmeans is relatively stable and portable, and it performs with low overhead of time on large volumes of data sets.

[1]  William Gropp,et al.  Implementing MPI: the 1994 MPI Implementors' Workshop , 1994, Proceedings Scalable Parallel Libraries Conference.

[2]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[3]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[4]  Prashant Pandey,et al.  Cloud computing , 2010, ICWET.

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Edie M. Rasmussen,et al.  Efficiency of Hierarchic Agglomerative Clustering using the ICL Distributed array Processor , 1989, J. Documentation.

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[9]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[10]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[11]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[12]  Xindong Wu,et al.  A 2-Tier Clustering Algorithm with Map-Reduce , 2010, 2010 Fifth Annual ChinaGrid Conference.

[13]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[14]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[15]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[16]  Alva L. Couch,et al.  Parallel K-means Clustering Algorithm on NOWs , 2003 .