Efficient Parallel Hierarchical Clustering

Hierarchical agglomerative clustering (HAC) is a common clustering method that outputs a dendrogram showing all N levels of agglomerations where N is the number of objects in the data set. High time and memory complexities are some of the major bottlenecks in its application to real-world problems. In the literature parallel algorithms are proposed to overcome these limitations. But, as this paper shows, existing parallel HAC algorithms are inefficient due to ineffective partitioning of the data. We first show how HAC follows a rule where most agglomerations have very small dissimilarity and only a small portion towards the end have large dissimilarity. Partially overlapping partitioning (POP) exploits this principle and obtains efficient yet accurate HAC algorithms. The total number of dissimilarities is reduced by a factor close to the number of cells in the partition. We present pPOP, the parallel version of POP, that is implemented on a shared memory multiprocessor architecture. Extensive theoretical analysis and experimental results are presented and show that pPOP gives close to linear speedup and outperforms the existing parallel algorithms significantly both in CPU time and memory requirements.

[1]  Xiaobo Li,et al.  Parallel clustering algorithms , 1989, Parallel Comput..

[2]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[4]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[5]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[6]  Alok N. Choudhary,et al.  A scalable parallel subspace clustering algorithm for massive data sets , 2000, Proceedings 2000 International Conference on Parallel Processing.

[7]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[8]  Shi-Jinn Horng,et al.  Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses , 2000, J. Parallel Distributed Comput..

[9]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[10]  Kian-Lee Tan,et al.  Fast hierarchical clustering and its validation , 2003, Data Knowl. Eng..