An improved mapReduce design of kmeans for clustering very large datasets

Clustering is an important data analysis technique which is used for the purpose to classify the data into groups of similar items. Clustering has high applicability in different fields. However, it becomes very challenging due to the sharply increase in the volume of data generated by modern applications. Kmeans is a simple and widely used algorithm for cluster analysis. But, the traditional k-means is computationally expensive, sensitive to outliers and has an unstable result hence its inefficiency when dealing with very large datasets. Solving these issues is the subject of many recent research works. MapReduce is a simplified programming model designed to process data intensive applications in a parallel environment. In this paper, we propose an improved design of k-means based on mapReduce in order to adapt it to handle large-scale datasets by reducing its execution time. Moreover we will propose two other algorithms. The first is designed to remove outliers from the dataset, and the second is designed to select automatically the initials centroids thereby stabilize the result. The implementation of our proposed algorithm on Hadoop platform shows that is much faster than three other existing algorithms in the literature.