A massively parallel adaptive fast-multipole method on heterogeneous architectures

We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.

[1]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[2]  Jakub Kurzak,et al.  Massively parallel implementation of a fast multipole method for distributed memory machines , 2005, J. Parallel Distributed Comput..

[3]  Eric Jui-Lin Lu,et al.  Parallel Fast Multipole Algorithm using MPI , 1995 .

[4]  Lexing Ying,et al.  A New Parallel Kernel-Independent Fast Multipole Method , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[5]  William Gropp,et al.  A Parallel Version of the Fast Multipole Method-Invited Talk , 1987, PPSC.

[6]  L. Greengard,et al.  Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[7]  Makoto Taiji,et al.  42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  D. Zorin,et al.  A kernel-independent adaptive fast multipole algorithm in two and three dimensions , 2004 .

[9]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[10]  Rajiv K. Kalia,et al.  Scalable and portable implementation of the fast multipole method on parallel computers , 2003 .

[11]  Richard W. Vuduc,et al.  Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[13]  Selim G. Akl,et al.  Design and analysis of parallel algorithms , 1985 .

[14]  L Greengard,et al.  Fast Algorithms for Classical Physics , 1994, Science.

[15]  L.V. Kale,et al.  Modeling biomolecules: larger scales, longer durations , 1994, IEEE Computational Science and Engineering.

[16]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[17]  Lexing Ying,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, SC.

[18]  Ananth Grama,et al.  Scalable parallel formulations of the Barnes-Hut method for n-body simulations , 1994, Proceedings of Supercomputing '94.

[19]  Andrew W. Appel,et al.  An Efficient Program for Many-Body Simulation , 1983 .

[20]  Matthew G. Knepley,et al.  Biomolecular electrostatics using a fast multipole BEM on up to 512 gpus and a billion unknowns , 2010, Comput. Phys. Commun..

[21]  Shang-Hua Teng,et al.  Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation , 1998, SIAM J. Sci. Comput..

[22]  Eric Jui-Lin Lu,et al.  An Efficient Load Balancing Technique for Parallel FMA in Message Passing Environment , 1997, PPSC.

[23]  S. Rao Kosaraju,et al.  Algorithms for dynamic closest pair and n-body potential fields , 1995, SODA '95.

[24]  V. Rokhlin Rapid solution of integral equations of classical potential theory , 1985 .

[25]  L. Greengard The Rapid Evaluation of Potential Fields in Particle Systems , 1988 .

[26]  AUT. SORfe,et al.  Computational Structure of the N-body Problem , 1989 .

[27]  Srinivas Aluru,et al.  A provably optimal, distribution-independent parallel fast multipole method , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[28]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[29]  Michael S. Warren,et al.  A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[30]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[31]  Santi S. Adavani,et al.  Dendro: Parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Gary L. Miller,et al.  Separators for sphere-packings and nearest neighbor graphs , 1997, JACM.

[33]  Simon Prunet,et al.  Full-sky weak-lensing simulation with 70 billion particles , 2008, 0807.3651.

[34]  Hari Sundar,et al.  Bottom-Up Construction and 2: 1 Balance Refinement of Linear Octrees in Parallel , 2008, SIAM J. Sci. Comput..

[35]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[36]  Richard K. Beatson,et al.  Fast Evaluation of Radial Basis Functions: Methods for Generalized Multiquadrics in Rn , 2001, SIAM J. Sci. Comput..

[37]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[38]  F. Sevilgen,et al.  A Unifying Data Structure for Hierarchical Methods , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[39]  Eric Darve,et al.  The fast multipole method on parallel clusters, multicore processors, and graphics processing units , 2011 .

[40]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[41]  Srinivas Aluru,et al.  Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods , 2005, Parallel Comput..

[42]  Al Geist,et al.  PVM (Parallel Virtual Machine) , 2011, Encyclopedia of Parallel Computing.

[43]  M. S. Warren,et al.  A parallel hashed Oct-Tree N-body algorithm , 1993, Supercomputing '93.

[44]  Ramani Duraiswami,et al.  Fast multipole methods on graphics processors , 2008, J. Comput. Phys..

[45]  Feng Zhao An O(N) Algorithm for Three-dimensional N-body Simulations , 2022 .

[46]  Srinivas Aluru,et al.  Fast, parallel, GPU-based construction of space filling curves and octrees , 2008, I3D '08.

[47]  Brian W. Barrett,et al.  Oak Ridge National Laboratory , Oak Ridge , TN , 2022 .

[48]  J. CARRIERt,et al.  A FAST ADAPTIVE MULTIPOLE ALGORITHM FOR PARTICLE SIMULATIONS * , 2022 .