pPOP: Fast yet accurate parallel hierarchical clustering using partitioning

Hierarchical agglomerative clustering (HAC) is very useful but due to high CPU time and memory complexity its practical use is limited. Earlier, we proposed an efficient partitioning - partially overlapping partitioning (POP) - based on the fact that in HAC small and closely placed clusters are agglomerated initially, and only towards the end larger and distant clusters are agglomerated. Here, we present the parallel version of POP, pPOP. Theoretical analysis shows that, compared to the existing algorithms, pPOP achieves CPU time speed-up and memory scale-down of O(c) without compromising accuracy where c is the number of cells in the partition. A shared memory implementation shows that pPOP outperforms existing algorithms significantly.

[1]  Kian-Lee Tan,et al.  Efficient yet accurate clustering , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Shi-Jinn Horng,et al.  Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses , 2000, J. Parallel Distributed Comput..

[5]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[6]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[7]  Michael Stonebraker,et al.  The SEQUOIA 2000 storage benchmark , 1993, SIGMOD '93.

[8]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[9]  Xiaobo Li,et al.  Parallel clustering algorithms , 1989, Parallel Comput..

[10]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[11]  Peter Scheuermann,et al.  A parallel algorithm for record clustering , 1990, TODS.

[12]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[14]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[17]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[18]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[19]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[20]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[21]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[22]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[23]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[24]  Kian-Lee Tan,et al.  Fast hierarchical clustering and its validation , 2003, Data Knowl. Eng..

[25]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[26]  Alok N. Choudhary,et al.  A scalable parallel subspace clustering algorithm for massive data sets , 2000, Proceedings 2000 International Conference on Parallel Processing.

[27]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[28]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .