Hierarchical N-Body Methods on Shared Address Space Multiprocessors

We examine the parallelization issues in and architectural implications of the two dominant adaptive hierarhcial N-body methods: the Barnes-Hut method and the Fast Multipole Method. We show that excellent parallel performance can be obtained on cache-coherent shared address space multiprocessors, by demonstrating performance on three cache-coherent machines: the Stanford DASH, the Kendall Square Research KSR-1, and the Silicon Graphics Challenge. Even on machines that have their main memory physically distributed among processing nodes and highly nonuniform memory access costs, the speedups are obtained without any attention to where memory is allocated on the machine (a simple round-robin page allocation scheme is used). We show that the reason for good performance is the high degree of temporal locality afforded by the applications, and the fact that working sets are small (and scale slowly) so that caching shared data automatically in hardware exploits this locality very effectively. Even if data distribution in main memory is assumed to be free, it does not help very much. Finally, we address a potential bottleneck in scaling the parallelism to large machines, namely the fraction of time spent in building the tree used by hierarchical N-body methods.