论文信息 - Parallel K-Medoids clustering algorithm based on Hadoop

Parallel K-Medoids clustering algorithm based on Hadoop

The K-Medoids clustering algorithm solves the problem of the K-Means algorithm on processing the outlier samples, but it is not be able to process big-data because of the time complexity[1]. MapReduce is a parallel programming model for processing big-data, and has been implemented in Hadoop. In order to break the big-data limits, the parallel K-Medoids algorithm HK-Medoids based on Hadoop was proposed. Every submitted job has many iterative MapReduce procedures: In the map phase, each sample was assigned to one cluster whose center is the most similar with the sample; in the combine phase, an intermediate center for each cluster was calculated; and in the reduce phase, the new center was calculated. The iterator stops when the new center is similar to the old one. The experimental results showed that HK-Medoids algorithm has a good clustering result and linear speedup for big-data.

Jiongmin Zhang | Yaobin Jiang | Jiongmin Zhang | Yaobin Jiang

[1] Jiawei Han,et al. CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[2] Chunming Rong,et al. K-means Clustering in the Cloud -- A Mahout Test , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[3] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4] Hae-Sang Park,et al. A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[5] Daniel T. Larose,et al. Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .