Design and Evaluation of a Parallel Execution Framework for the CLEVER Clustering Algorithm

Data mining is used to extract valuable knowledge from vast pools of data. Due to the computational complexity of the algorithms applied and the problems of handling large data sets themselves, data mining applications often require days to perform their analysis when dealing with large data sets. This paper presents the design and evaluation of a parallel computation framework for CLEVER, a prototype-based clustering algorithm which has been successfully used for a wide range of application scenarios. The algorithm supports plug-in fitness functions and employs randomized hill climbing to maximize a given fitness function. We explore various parallelization strategies using OpenMP and CUDA, and evaluate the performance of the parallel algorithms for three different data sets. Our results indicate a nearly linear scalability of the parallel algorithm using multi-core processors, reducing the execution time and allowing to solve problems which were considered not feasible with the sequential version of CLEVER.

[1]  Edgar Gabriel,et al.  Towards high performance cell segmentation in multispectral fine needle aspiration cytology of thyroid lesions , 2010, Comput. Methods Programs Biomed..

[2]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[3]  Ying Xu,et al.  Parallel Clustering Algorithm for Large Data Sets with Applications in Bioinformatics , 2009, IEEE/ACM Transactions on Computational Biology & Bioinformatics.

[4]  Hans A. Kestler,et al.  A highly efficient multi-core algorithm for clustering extremely large datasets , 2010, BMC Bioinformatics.

[5]  Domenico Talia,et al.  Scalable Parallel Clustering for Data Mining on Multicomputers , 2000, IPDPS Workshops.

[6]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[7]  Christoph F. Eick,et al.  REG^2: a regional regression framework for geo-referenced datasets , 2009, GIS.

[8]  Christoph F. Eick,et al.  Correspondence Clustering: An Approach to Cluster Multiple Related Spatial Datasets , 2010, PAKDD.

[9]  Nittaya Kerdprasop,et al.  Parallelization of K-means clustering on multi-core processors , 2010 .

[10]  Meichun Hsu,et al.  Clustering billions of data points using GPUs , 2009, UCHPC-MAW '09.

[11]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[12]  Christoph F. Eick,et al.  Finding regional co-location patterns for sets of continuous variables in spatial datasets , 2008, GIS '08.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Christoph F. Eick,et al.  Analyzing change in spatial data by utilizing polygon models , 2010, COM.Geo '10.