Optimizing the multipole‐to‐local operator in the fast multipole method for graphical processing units

This paper presents a number of algorithms to run the fast multipole method (FMM) on NVIDIA CUDAcapable graphical processing units (GPUs) (Nvidia Corporation, Sta. Clara, CA, USA). The FMM is a class of methods to compute pairwise interactions between N particles for a given error tolerance and with computational cost of O.N /. The methods described in the paper are applicable to any FMMs in which the multipole-to-local (M2L) operator is a dense matrix and the matrix is precomputed. This is the case for example in the black-box fast multipole method (bbFMM), which is a variant of the FMM that can handle large class of kernels. This example will be used in our benchmarks. In the FMM, two operators represent most of the computational cost, and an optimal implementation typically tries to balance those two operators. One is the nearby interaction calculation (direct sum calculation, line 29 in Listing 1), and the other is the M2L operation. We focus on the M2L. By combining multiple M2L operations and reordering the primitive loops of the M2L so that CUDA threads can reuse or share common data, these approaches reduce the movement of data in the GPU. Because memory bandwidth is the primary bottleneck of these methods, significant performance improvements are realized. Four M2L schemes are detailed and analyzed in the case of a uniform tree. The four schemes are tested and compared with an optimized, OpenMP parallelized, multi-core CPU code. We consider high and low precision calculations by varying the number of Chebyshev nodes used in the bbFMM. The accuracy of the GPU codes is found to be satisfactory and achieved performance over 200 Gflop/s on one NVIDIA Tesla C1060 GPU (Nvidia Corporation, Sta. Clara, CA, USA). This was compared against two quad-core Intel Xeon E5345 processors (Intel Corporation, Sta. Clara, CA, USA) running at 2.33 GHz, for a combined peak performance of 149 Gflop/s for single precision. For the low FMM accuracy case, the observed performance of the CPU code was 37 Gflop/s, whereas for the high FMM accuracy case, the performance was about 8.5 Gflop/s, most likely because of a higher frequency of cache misses. We also present benchmarks on an NVIDIA C2050 GPU (a Fermi processor)(Nvidia Corporation, Sta. Clara, CA, USA) in single and double precision. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  Eric Darve,et al.  The black-box fast multipole method , 2009, J. Comput. Phys..

[2]  V. Rokhlin Rapid solution of integral equations of classical potential theory , 1985 .

[3]  Lexing Ying,et al.  A New Parallel Kernel-Independent Fast Multipole Method , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[4]  Toru Takahashi,et al.  607 An Implementation of Low-Frequency Fast Multipole BIEM for Helmholtz' Equation on GPU , 2010 .

[5]  Zydrunas Gimbutas,et al.  A Generalized Fast Multipole Method for Nonoscillatory Kernels , 2003, SIAM J. Sci. Comput..

[6]  Michael A. Epton,et al.  Multipole Translation Theory for the Three-Dimensional Laplace and Helmholtz Equations , 1995, SIAM J. Sci. Comput..

[7]  N. Nishimura Fast multipole accelerated boundary integral equation methods , 2002 .

[8]  Shinnosuke Obi,et al.  Fast multipole methods on a cluster of GPUs for the meshless simulation of turbulence , 2009, Comput. Phys. Commun..

[9]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .

[10]  Feng Zhao,et al.  The Parallel Multipole Method on the Connection Machine , 1991, SIAM J. Sci. Comput..

[11]  L. Greengard,et al.  Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[12]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[13]  D. Zorin,et al.  A kernel-independent adaptive fast multipole algorithm in two and three dimensions , 2004 .

[14]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[15]  J. CARRIERt,et al.  A FAST ADAPTIVE MULTIPOLE ALGORITHM FOR PARTICLE SIMULATIONS * , 2022 .

[16]  Olivier Coulaud,et al.  High performance BLAS formulation of the adaptive Fast Multipole Method , 2010, Math. Comput. Model..

[17]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[18]  Richard W. Vuduc,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  Makoto Taiji,et al.  42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Olivier Coulaud,et al.  High performance BLAS formulation of the multipole-to-local operator in the fast multipole method , 2005, J. Comput. Phys..

[21]  Matthew G. Knepley,et al.  PetFMM—A dynamically load‐balancing parallel fast multipole library , 2009, ArXiv.

[22]  Per-Gunnar Martinsson,et al.  An Accelerated Kernel-Independent Fast Multipole Method in One Dimension , 2007, SIAM J. Sci. Comput..

[23]  Ramani Duraiswami,et al.  Fast multipole methods on graphics processors , 2008, J. Comput. Phys..

[24]  Martin Head-Gordon,et al.  Rotating around the quartic angular momentum barrier in fast multipole method calculations , 1996 .

[25]  William Gropp,et al.  A Parallel Version of the Fast Multipole Method-Invited Talk , 1987, PPSC.

[26]  Richard W. Vuduc,et al.  Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.