Parallel Algorithms for Hierarchical Clustering

Hierarchical clustering is common method used to determine clusters of similar data points in multi-dimensional spaces. $O(n^2)$ algorithms, where $n$ is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe $O(n)$ time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an $n$ node CRCW PRAM and $O(n \log n)$ algorithms for these metrics (except average link and complete link) on $\frac{n}{\log n}$ node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.

[1]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[2]  F. James Rohlf,et al.  A Probabilistic Minimum Spanning Tree Algorithm , 1978, Inf. Process. Lett..

[3]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[4]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[5]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[6]  Bruce W. Weide,et al.  Optimal Expected-Time Algorithms for Closest Point Problems , 1980, TOMS.

[7]  R. Prim Shortest connection networks and some generalizations , 1957 .

[8]  Francisco F. Rivera,et al.  Parallel Squared Error Clustering on Hypercube Arrays , 1990, J. Parallel Distributed Comput..

[9]  Xiaobo Li,et al.  Parallel clustering algorithms , 1989, Parallel Comput..

[10]  Sartaj Sahni,et al.  Clustering on a Hypercube Multicomputer , 1991, IEEE Trans. Parallel Distributed Syst..

[11]  George C. Stockman,et al.  Object recognition and localization via pose clustering , 1987, Comput. Vis. Graph. Image Process..

[12]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[13]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[14]  V. Benes,et al.  Mathematical Theory of Connecting Networks and Telephone Traffic. , 1966 .

[15]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[16]  Fionn Murtagh Expected-Time Complexity Results for Hierarchic Clustering Algorithms Which Use Cluster Centres , 1983, Inf. Process. Lett..

[17]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[18]  Andrew Chi-Chih Yao,et al.  On Constructing Minimum Spanning Trees in k-Dimensional Spaces and Related Problems , 1977, SIAM J. Comput..

[19]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[20]  Dana H. Ballard,et al.  Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[21]  Larry S. Davis,et al.  Pose Determination of a Three-Dimensional Object Using Triangle Pairs , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[23]  C. Olson Time and Space Eecient Pose Clustering , 1993 .

[24]  D. W. Thompson,et al.  Three-dimensional model matching from an unconstrained viewpoint , 1987, Proceedings. 1987 IEEE International Conference on Robotics and Automation.

[25]  Edie M. Rasmussen,et al.  Efficiency of Hierarchic Agglomerative Clustering using the ICL Distributed array Processor , 1989, J. Documentation.

[26]  A. Mullin,et al.  Mathematical Theory of Connecting Networks and Telephone Traffic. , 1966 .

[27]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[28]  Francisco F. Rivera,et al.  Parallel fuzzy clustering on fixed size hypercube SIMD computers , 1989, Parallel Comput..

[29]  Robert E. Tarjan,et al.  Relaxed heaps: an alternative to Fibonacci heaps with applications to parallel computation , 1988, CACM.