Data‐driven execution of fast multipole methods

Fast multipole methods (FMMs) have O (N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next‐generation supercomputers. Their most common application is to accelerate N‐body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load balancing becomes a non‐trivial question. A common strategy for load balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data‐driven execution to efficiently tackle this challenging load balancing problem. The core idea consists of breaking the most time‐consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a directed acyclic graph where nodes represent tasks and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the queueing and runtime for kernels runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out‐of‐order execution. The performance results of the data‐driven FMM execution outperform the previous strategy and show linear speedup on a quad‐socket quad‐core Intel Xeon system.Copyright © 2013 John Wiley & Sons, Ltd.

[1]  Jack J. Dongarra,et al.  Scheduling dense linear algebra operations on multicore processors , 2010, Concurr. Comput. Pract. Exp..

[2]  Jack J. Dongarra,et al.  A Comprehensive Study of Task Coalescing for Selecting Parallelism Granularity in a Two-Stage Bidiagonal Reduction , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  Hari Sundar,et al.  Bottom-Up Construction and 2: 1 Balance Refinement of Linear Octrees in Parallel , 2008, SIAM J. Sci. Comput..

[4]  Jakub Kurzak,et al.  Massively parallel implementation of a fast multipole method for distributed memory machines , 2005, J. Parallel Distributed Comput..

[5]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[6]  Chandrajit L. Bajaj,et al.  An Efficient Higher-Order Fast Multipole Boundary Element Solution for Poisson-Boltzmann-Based Molecular Electrostatics , 2011, SIAM J. Sci. Comput..

[7]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  John Dubinski A parallel tree code , 1996 .

[9]  Robert A. van de Geijn,et al.  The libflame Library for Dense Matrix Computations , 2009, Computing in Science & Engineering.

[10]  L. Greengard,et al.  Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[11]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[12]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[13]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[14]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[15]  Michael S. Warren,et al.  A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[16]  Shang-Hua Teng,et al.  Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation , 1998, SIAM J. Sci. Comput..

[17]  Rio Yokota,et al.  Petascale turbulence simulation using a highly parallel fast multipole method on GPUs , 2011, Comput. Phys. Commun..

[18]  Richard W. Vuduc,et al.  Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Lorena A. Barba,et al.  Hierarchical N-body Simulations with Autotuning for Heterogeneous Systems , 2012, Computing in Science & Engineering.

[20]  Michael A. Epton,et al.  Multipole Translation Theory for the Three-Dimensional Laplace and Helmholtz Equations , 1995, SIAM J. Sci. Comput..

[21]  Anoop Gupta,et al.  Load Balancing and Data locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Rasiosity , 1995, J. Parallel Distributed Comput..

[22]  B. Shanker,et al.  A Novel Wideband FMM for Fast Integral Equation Solution of Multiscale Problems in Electromagnetics , 2009, IEEE Transactions on Antennas and Propagation.

[23]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[24]  W. R. Sutherland,et al.  The on-line graphical specification of computer procedures , 1966 .

[25]  Richard W. Vuduc,et al.  A massively parallel adaptive fast-multipole method on heterogeneous architectures , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[26]  Matthew G. Knepley,et al.  PetFMM—A dynamically load‐balancing parallel fast multipole library , 2009, ArXiv.

[27]  Qibai Huang,et al.  A fast multipole boundary element method based on the improved Burton–Miller formulation for three-dimensional acoustic problems , 2011 .

[28]  Emmanuel Agullo,et al.  Comparative study of one-sided factorizations with multiple software packages on multi-core hardware , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[29]  Michael S. Warren,et al.  Skeletons from the treecode closet , 1994 .

[30]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[31]  Wenbin Lin,et al.  Volumetric fast multipole method for modeling Schrödinger's equation , 2007, J. Comput. Phys..

[32]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  Anoop Gupta,et al.  A parallel adaptive fast multipole method , 1993, Supercomputing '93. Proceedings.

[34]  Qian Xi Wang,et al.  Variable order revised binary treecode , 2004 .

[35]  Stéphanie Chaillat,et al.  A multi-level fast multipole BEM for 3-D elastodynamics in the frequency domain , 2008 .

[36]  Michael S. Warren,et al.  A portable parallel particle program , 1995 .

[37]  Michael S. Warren,et al.  Astrophysical N-body simulations using hierarchical tree data structures , 1992, Proceedings Supercomputing '92.

[38]  Walter Dehnen,et al.  A Hierarchical O(N) Force Calculation Algorithm , 2002 .