Research on Improved k-Means Clustering Algorithm Based on Hadoop Platform

In this paper, aiming at the problems of traditional K-means clustering algorithm in big data processing, such as performance and determination of initial clustering center, an improved k-means clustering algorithm based on Hadoop platform is proposed. This algorithm uses canopy algorithm and cosine similarity to calculate, optimizes the determination of initial clustering center by K-means algorithm, and uses parallel computing framework to expand the algorithm in parallel. To adapt to big data processing. The experimental results show that the improved k-means clustering algorithm based on Hadoop platform has better clustering effect, and also has good speedup and scalability when processing a large number of data.

[1]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[2]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..