Parallel K-Means Clustering Based on MapReduce

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

[1]  Xiaobo Li,et al.  Parallel clustering algorithms , 1989, Parallel Comput..

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[4]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[5]  Edie M. Rasmussen,et al.  Efficiency of Hierarchic Agglomerative Clustering using the ICL Distributed array Processor , 1989, J. Documentation.

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Ralf Lämmel,et al.  Google's MapReduce programming model - Revisited , 2007, Sci. Comput. Program..

[8]  GhemawatSanjay,et al.  The Google file system , 2003 .

[9]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[10]  Christoforos E. Kozyrakis,et al.  Evaluating MapReduce for Multi-core and Multiprocessor Systems , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.