CUDA-Based Parallelization of Power Iteration Clustering for Large Datasets

This paper presents a new clustering algorithm, the GPIC, a graphics processing unit (GPU) accelerated algorithm for power iteration clustering (PIC). Our algorithm is based on the original PIC proposal, adapted to take advantage of the GPU architecture, maintaining the algorithm’s original properties. The proposed method was compared against the serial implementation, achieving a considerable speedup in tests with synthetic and real data sets. A significant volume of real data application ( $> 10^{7}$ records) was used, and we identified that GPIC implementation has good scalability to handle data sets with millions of data points. Our implementation efforts are directed towards two aspects: to process large data sets in less time and to maintain the same quality of the clusters results generated by the original PIC version.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  P. Royston Approximating the Shapiro-Wilk W-test for non-normality , 1992 .

[3]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[4]  Chih-Jen Lin,et al.  PSC : Parallel Spectral Clustering , 2008 .

[5]  Eréndira Rendón,et al.  A comparison of internal and external cluster validation indexes , 2011 .

[6]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[7]  Vance W. Berger,et al.  Wilcoxon–Mann–Whitney Test , 2005 .

[8]  Meichun Hsu,et al.  Clustering billions of data points using GPUs , 2009, UCHPC-MAW '09.

[9]  Karolin Baecker,et al.  Two Dimensional Signal And Image Processing , 2016 .

[10]  L. Hubert,et al.  Comparing partitions , 1985 .

[11]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[12]  Ying Wah Teh,et al.  Big Data Clustering: A Review , 2014, ICCSA.

[13]  Fei Wang,et al.  Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism , 2013, IDEAL.

[14]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[15]  Li Li,et al.  A parallel way for computing eigenvector sensitivity of asymmetric damped systems with distinct and repeated eigenvalues , 2012 .

[16]  Kazufumi Ito,et al.  Gaussian filters for nonlinear filtering problems , 2000, IEEE Trans. Autom. Control..

[17]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[18]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[19]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[21]  Rafael Sachetto Oliveira,et al.  G-DBSCAN: A GPU Accelerated Algorithm for Density-based Clustering , 2013, ICCS.

[22]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .