A Parallel Clustering Algorithm with MPI - MKmeans

Clustering is one of the most popular methods for exploratory data analysis, which is prevalent in many disciplines such as image segmentation, bioinformatics, pattern recognition and statistics etc. The most famous clustering algorithm is K-means because of its easy implementation, simplicity, efficiency and empirical success. However, the real-world applications produce huge volumes of data, thus, how to efficiently handle of these data in an important mining task has been a challenging and significant issue. In addition, MPI (Message Passing Interface) as a programming model of message passing presents high performances, scalability and portability. Motivated by this, a parallel K-means clustering algorithm with MPI, called MKmeans, is proposed in this paper. The algorithm enables applying the clustering algorithm effectively in the parallel environment. Experimental study demonstrates that MKmeans is relatively stable and portable, and it performs with low overhead of time on large volumes of data sets. Index Terms—clustering, K-means algorithm, MPI, parallel computing

[1]  Alva L. Couch,et al.  Parallel K-means Clustering Algorithm on NOWs , 2003 .

[2]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[3]  Renato Cordeiro de Amorim,et al.  Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering , 2012, Pattern Recognit..

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  William Gropp,et al.  Implementing MPI: the 1994 MPI Implementors' Workshop , 1994, Proceedings Scalable Parallel Libraries Conference.

[6]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[7]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[9]  Prashant Pandey,et al.  Cloud computing , 2010, ICWET.

[10]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[11]  Xindong Wu,et al.  A 2-Tier Clustering Algorithm with Map-Reduce , 2010, 2010 Fifth Annual ChinaGrid Conference.

[12]  Christian Sohler,et al.  A fast k-means implementation using coresets , 2006, SCG '06.

[13]  SkjellumAnthony,et al.  A high-performance, portable implementation of the MPI message passing interface standard , 1996 .

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  Ian H. Witten,et al.  Weka-A Machine Learning Workbench for Data Mining , 2005, Data Mining and Knowledge Discovery Handbook.

[16]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[17]  Edie M. Rasmussen,et al.  Efficiency of Hierarchic Agglomerative Clustering using the ICL Distributed array Processor , 1989, J. Documentation.

[18]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[19]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[20]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[21]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[22]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..