论文信息 - An improved mapReduce design of kmeans for clustering very large datasets

An improved mapReduce design of kmeans for clustering very large datasets

Clustering is an important data analysis technique which is used for the purpose to classify the data into groups of similar items. Clustering has high applicability in different fields. However, it becomes very challenging due to the sharply increase in the volume of data generated by modern applications. Kmeans is a simple and widely used algorithm for cluster analysis. But, the traditional k-means is computationally expensive, sensitive to outliers and has an unstable result hence its inefficiency when dealing with very large datasets. Solving these issues is the subject of many recent research works. MapReduce is a simplified programming model designed to process data intensive applications in a parallel environment. In this paper, we propose an improved design of k-means based on mapReduce in order to adapt it to handle large-scale datasets by reducing its execution time. Moreover we will propose two other algorithms. The first is designed to remove outliers from the dataset, and the second is designed to select automatically the initials centroids thereby stabilize the result. The implementation of our proposed algorithm on Hadoop platform shows that is much faster than three other existing algorithms in the literature.

[1] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[2] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[3] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[4] Bo Li,et al. Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce , 2010, WISM.

[5] Adrian E. Raftery,et al. How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[6] Li Ma,et al. An Improved K-means Algorithm based on Mapreduce and Grid , 2015 .

[7] LiKeqiu,et al. Optimized big data K-means clustering using MapReduce , 2014 .

[8] Phayung Meesad,et al. Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method , 2014, KSE.

[9] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.