An improved affinity propagation clustering algorithm for large-scale data sets

Affinity Propagation (AP) clustering does not need to set the number of clusters, and has advantages on efficiency and accuracy, but is not suitable for large-scale data clustering. To ensure both a low time complexity and a good accuracy for the clustering method of affinity propagation on large-scale data clustering, an improved AP clustering algorithm named hierarchical affinity propagation (HAP) is proposed, which clusters data points by using AP algorithm several times on different level data. The data set to be clustered is firstly divided into several subsets, each of which can be efficiently clustered by AP algorithm. Then, the AP algorithm is performed on each subset to respectively select cluster centers of each subset. Further, AP clustering was again implemented on all the local cluster centers to select well-suited global exemplars of whole data set. Finally, to efficiently and accurately cluster data points in a large-scale, all the data points are clustered by the similarities between each data point and the global exemplars. The experimental results on real and simulated data sets show that, compared with the traditional AP and adaptive AP algorithm, the HAP algorithm can greatly reduce the clustering time consumption with a relatively better clustering results.

[1]  Zhigang Luo,et al.  Overlapping Community Detection by Kernel-Based Fuzzy Affinity Propagation , 2010, 2010 2nd International Workshop on Intelligent Systems and Applications.

[2]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[3]  Tu Chong-yang,et al.  Semi-supervised Affinity Propagation Clustering , 2007 .

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Xiaolong Wang,et al.  An adaptive affinity propagation document clustering , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[6]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[7]  Tao Guo,et al.  Adaptive Affinity Propagation Clustering , 2008, ArXiv.

[8]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[9]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..