Fast multipole methods on graphics processors

The fast multipole method allows the rapid approximate evaluation of sums of radial basis functions. For a specified accuracy, @e, the method scales as O(N) in both time and memory compared to the direct method with complexity O(N^2), which allows the solution of larger problems with given resources. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on data-parallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architecture; and determined optimal settings for the FMM on the GPU. These optimal settings are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range of 30-60 compared to a serial CPU FMM implementation. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at a 43 Teraflop rate if we use straightforward summation.

[1]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[2]  William Gropp,et al.  A Parallel Version of the Fast Multipole Method-Invited Talk , 1987, PPSC.

[3]  Anoop Gupta,et al.  A parallel adaptive fast multipole method , 1993, Supercomputing '93. Proceedings.

[4]  Ramani Duraiswami,et al.  Fast multipole method for the biharmonic equation in three dimensions , 2006, J. Comput. Phys..

[5]  Ramani Duraiswami,et al.  Comparison of the efficiency of translation operators used in the fast multipole method for the 3D Laplace equation , 2005 .

[6]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[7]  Mark J. Stock,et al.  Toward efficient GPU-accelerated N-body simulations , 2008 .

[8]  Tsuyoshi Hamada,et al.  The Chamomile Scheme: An Optimized Algorithm for N-body simulations on Programmable Graphics Processing Units , 2007 .

[9]  Michael A. Epton,et al.  Multipole Translation Theory for the Three-Dimensional Laplace and Helmholtz Equations , 1995, SIAM J. Sci. Comput..

[10]  Jack J. Dongarra,et al.  Guest Editors Introduction to the top 10 algorithms , 2000, Comput. Sci. Eng..

[11]  R. Duraiswami,et al.  Fast Multipole Methods for the Helmholtz Equation in Three Dimensions , 2005 .

[12]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[13]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[14]  Anoop Gupta,et al.  Load Balancing and Data locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Rasiosity , 1995, J. Parallel Distributed Comput..

[15]  J. Makino,et al.  GRAPE-6A: A Single-Card GRAPE-6 for Parallel PC-GRAPE Cluster Systems , 2005, astro-ph/0504407.

[16]  Ramani Duraiswami,et al.  Middleware for programming NVIDIA GPUs from Fortran 9X , 2007 .

[17]  Martin Head-Gordon,et al.  Rotating around the quartic angular momentum barrier in fast multipole method calculations , 1996 .

[18]  Shang-Hua Teng,et al.  Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation , 1998, SIAM J. Sci. Comput..

[19]  Simon Portegies Zwart,et al.  High-performance direct gravitational N-body simulations on graphics processing units , 2007, astro-ph/0702058.

[20]  L. Greengard,et al.  Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[21]  David H. Bailey,et al.  Multiprecision Translation and Execution of Fortran Programs , 1993 .

[22]  L. Greengard,et al.  A new version of the Fast Multipole Method for the Laplace equation in three dimensions , 1997, Acta Numerica.

[23]  Robert G. Belleman,et al.  High Performance Direct Gravitational N-body Simulations on Graphics Processing Units , 2007, ArXiv.

[24]  Toshiyuki Fukushige,et al.  GRAPE-6: Massively-Parallel Special-Purpose Computer for Astrophysical Particle Simulations , 2003, astro-ph/0310702.

[25]  HennessyJohn,et al.  Load balancing and data locality in adaptive hierarchical N-body methods , 1995 .

[26]  Thomas L. Sterling,et al.  Pentium Pro Inside: I. A Treecode at 430 Gigaflops on ASCI Red, II. Price/Performance of $50/Mflop on Loki and Hyglac , 1997, ACM/IEEE SC 1997 Conference (SC'97).