Optimisation Techniques for Parallel K-Means on MapReduce

The K-Means algorithm is one the most efficient and widely used algorithms for clustering data. However, K-Means performance tends to get slower as data grows larger in size. Moreover, the rapid increase in the size of data has motivated the scientific and industrial communities to develop novel technologies that meet the needs of storing, managing, and analysing large-scale datasets known as Big Data. This paper describes the implementation of parallel K-Means on the MapReduce framework, which is a distributed framework best known for its reliability in processing large-scale datasets. Moreover, a detailed analysis of the effect of distance computations on the performance of K-Means on MapReduce is introduced. Finally, two optimisation techniques are suggested to accelerate K-Means on MapReduce by reducing distance computations per iteration to achieve the same deterministic results.

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[4]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[5]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[6]  Giuseppe Di Fatta,et al.  Space Partitioning for Scalable K-Means , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[7]  Giuseppe Di Fatta,et al.  Dynamic Load Balancing in Parallel KD-Tree k-Means , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[8]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[9]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[10]  Anil K. Jain,et al.  Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Jimmy J. Lin,et al.  Web-scale computer vision using MapReduce for multimedia data mining , 2010, MDMKDD '10.

[12]  David Pettinger,et al.  Scalability of efficient parallel K-Means , 2009, 2009 5th IEEE International Conference on E-Science Workshops.

[13]  GhemawatSanjay,et al.  The Google file system , 2003 .

[14]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.