An optimal distributed K-Means clustering algorithm based on cloudstack

Clustering algorithm is applied to all kinds of fields, especially in the field of data mining. Due to the increasing number of the data, it's too hard for the clustering algorithm to afford the computation time in traditional computing model. When handling with big data, the corresponding algorithms of data mining have been transformed from the original single-core or single ported into the parallel and distributed processing. Parallel processing becomes the most popular way to improve the execution performance. This paper established a Hadoop distributed cluster based on the CloudStack and implemented the optimal distributed K-Means clustering algorithm based on MapReduce. The proposed optimal distributed K-Means clustering can obtain good quality of the results and the efficiency of the execution time. The experiment results show that the optimal distributed K-Means cluster algorithm can have better performance for dealing with large-scale data set.