Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation

Clustering, or unsupervised classification, has many uses in fields that depend on grouping results from large amount of data, an example being the N-body cosmological simulation in astrophysics. In this paper, we study a particular clustering algorithm used in astrophysics, called HOP, and present a parallel implementation to speed up its current sequential implementation. Our approach first builds in parallel the spatial domain hierarchical data structure, a three-dimensional KD tree. Using a KD tree, the core of the HOP algorithm that searches for the highest density neighbor can be performed using only subsets of the particles and hence the communication cost is reduced. We evaluate our implementation by using data sets from a production cosmological application. The experimental results demonstrate up to 24/spl times/ speedup using 64 processors on three parallel processing machines.

[1]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[2]  Michael L. Norman,et al.  Achieving Extreme Resolution in Numerical Cosmology Using Adaptive Mesh Refinement: Resolving Primordial Star Formation , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[3]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[4]  G. Karypis,et al.  Criterion functions for document clustering , 2005 .

[5]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[6]  J. M. Gelb,et al.  Cold dark matter. 1: The Formation of dark halos , 1994, astro-ph/9408028.

[7]  G. Efstathiou,et al.  The evolution of large-scale structure in a universe dominated by cold dark matter , 1985 .

[8]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[9]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[10]  R. Prim Shortest connection networks and some generalizations , 1957 .

[11]  Bin Zhang,et al.  Linear Speed-Up for a Parallel Non-Approximate Recasting of Center-Based Clustering Algorithms, including K-Means, K-Harmonic Means, and EM 1 , 2000 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[16]  John Shalf,et al.  Diving deep: data-management and visualization strategies for adaptive mesh refinement simulations , 1999, Comput. Sci. Eng..

[17]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[18]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[19]  D. Eisenstein,et al.  HOP: A New Group-finding Algorithm for N-Body Simulations , 1997, astro-ph/9712200.

[20]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[21]  Wei-keng Liao,et al.  I/O analysis and optimization for an AMR cosmology application , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[22]  Thomas L. Sterling,et al.  Halo World: Tools for Parallel Cluster Finding in Astrophysical N-body Simulations , 1997, Data Mining and Knowledge Discovery.

[23]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[24]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .