Load Balancing and Data locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Rasiosity

Abstract Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for long-range communication. In this paper, we study the partitioning and scheduling techniques required to obtain effective parallel performance on applications that use a range of hierarchical N-body methods. To obtain representative coverage, we first examine applications that use the two best methods known for classical N-body problems: the Barnes-Hut method and the fast multipole method. Then, we examine a recent hierarchical method for radiosity calculations in computer graphics, which applies the hierarchical N-body approach to a problem with very different characteristics. We find that straightforward decomposition techniques which an automatic scheduler might implement do not scale well, because they are unable to simultaneously provide load balancing and data locality. However, all the applications yield very good parallel performance if appropriate partitioning and scheduling techniques are implemented by the programmer. For the applications that use the Barnes-Hut and fast multipole methods, simple yet effective partitioning techniques can be developed by exploiting some key insights into both the methods and the classical problems that they solve. Using a novel partitioning technique, even relatively small problems achieve 45-fold speedups on a 48-processor Stanford DASH machine (a cache-coherent, shared address space multiprocessor) and 118-fold speedups on a 128-processor simulated architecture. The very different characteristics of the radiosity application require a different partitioning/scheduling approach to be used for it; however, it too yields very good parallel performance.

[1]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[2]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[3]  Anoop Gupta,et al.  Scaling parallel programs for multiprocessors: methodology and examples , 1993, Computer.

[4]  Feng Zhao,et al.  An {\it bf O(N)} Algorithm for Three-Dimensional N-body Simulations , 1987 .

[5]  L. Greengard The Rapid Evaluation of Potential Fields in Particle Systems , 1988 .

[6]  Feng Zhao An O(N) Algorithm for Three-dimensional N-body Simulations , 2022 .

[7]  Fox,et al.  Load balancing and sparse matrix vector multiplication on the hypercube , 1986 .

[8]  Jacob Katzenelson Computational structure of the N-body problem , 1989 .

[9]  David H. Porter,et al.  A tree code with logarithmic reduction of force terms, hierarchical regularization of all variables, and explicit accuracy controls , 1989 .

[10]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[11]  Pat Hanrahan,et al.  A rapid hierarchical radiosity algorithm , 1991, SIGGRAPH.

[12]  Tony F. Chan,et al.  Hierarchical algorithms and architectures for parallel scientific computing , 1990, ICS '90.

[13]  L. Hernquist Hierarchical N-body methods , 1987 .

[14]  Jaswinder Pal Singh,et al.  Hierarchical n-body methods and their implications for multiprocessors , 1993 .

[15]  Henry Fuchs,et al.  Near real-time shaded display of rigid objects , 1983, SIGGRAPH.

[16]  A. Chorin Numerical study of slightly viscous flow , 1973, Journal of Fluid Mechanics.

[17]  Donald P. Greenberg,et al.  The hemi-cube: a radiosity solution for complex environments , 1985, SIGGRAPH.

[18]  Andrew W. Appel,et al.  An Efficient Program for Many-Body Simulation , 1983 .

[19]  John K. Salmon,et al.  Parallel hierarchical N-body methods , 1992 .

[20]  John L. Hennessy,et al.  Performance debugging shared memory multiprocessor programs with MTOOL , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[21]  Helen Davis,et al.  Tango introduction and tutorial , 1990 .

[22]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[23]  V. Rokhlin Rapid solution of integral equations of classical potential theory , 1985 .

[24]  Anthony Leonard,et al.  Robust vortex methods for three-dimensional incompressible flows , 1988 .

[25]  Anoop Gupta,et al.  Working sets, cache sizes, and node granularity issues for large-scale multiprocessors , 1993, ISCA '93.