Asynchronous Task Scheduling of the Fast Multipole Method Using Various Runtime Systems

In this paper, we explore data-driven execution of the adaptive fast multipole method by asynchronously scheduling available computational tasks using Cilk, C++11 standard thread and future libraries, the High Performance ParalleX (HPX-5) library, and OpenMP tasks. By comparing these implementations using various input data sets, this paper examines the runtime system's capability to spawn new task, the capacity of the tasks that can be managed, the performance impact between eager and lazy thread creation for new task, and the effectiveness of the task scheduler and its ability to recognize the critical path of the underlying algorithm.

[1]  Weng Cho Chew,et al.  Integral Equation Methods for Electromagnetic and Elastic Waves , 2007, Synthesis Lectures on Computational Electromagnetics.

[2]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1995, SPAA '95.

[3]  M. Brodowicz,et al.  Application Characteristics of Many-tasking Execution Models , 2013 .

[4]  William Gropp,et al.  A Parallel Version of the Fast Multipole Method-Invited Talk , 1987, PPSC.

[5]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[6]  D. N. Jayasimha,et al.  What is an effective schedule? , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[7]  L. Greengard,et al.  Regular Article: A Fast Adaptive Multipole Algorithm in Three Dimensions , 1999 .

[8]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[9]  S. Tucker Taft,et al.  Information technology — Programming Languages — Ada , 2001 .

[10]  Kyle B. Wheeler,et al.  A Comparative Critical Analysis of Modern Task-Parallel Runtimes , 2012 .

[11]  Benzhuo Lu,et al.  Order N algorithm for computation of electrostatic interactions in biomolecular systems , 2006, Proceedings of the National Academy of Sciences.

[12]  Hari Sundar,et al.  Bottom-Up Construction and 2: 1 Balance Refinement of Linear Octrees in Parallel , 2008, SIAM J. Sci. Comput..

[13]  D. Zorin,et al.  A kernel-independent adaptive fast multipole algorithm in two and three dimensions , 2004 .

[14]  Thomas L. Sterling,et al.  Improving the scalability of parallel N-body applications with an event-driven constraint-based execution model , 2012, Int. J. High Perform. Comput. Appl..

[15]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[16]  Satoshi Matsuoka,et al.  Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM , 2013, ISC.

[17]  Emmanuel Agullo,et al.  Task-Based FMM for Multicore Architectures , 2014, SIAM J. Sci. Comput..

[18]  Michael S. Warren,et al.  A parallel hashed oct-tree N-body algorithm , 1993, Supercomputing '93. Proceedings.

[19]  Thomas L. Sterling,et al.  Preliminary design examination of the ParalleX system from a software and hardware perspective , 2011, PERV.

[20]  Allan Porterfield,et al.  OpenMP task scheduling strategies for multicore NUMA systems , 2012, Int. J. High Perform. Comput. Appl..

[21]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[22]  J. CARRIERt,et al.  A FAST ADAPTIVE MULTIPOLE ALGORITHM FOR PARTICLE SIMULATIONS * , 2022 .

[23]  Hatem Ltaief,et al.  Data‐driven execution of fast multipole methods , 2012, Concurr. Comput. Pract. Exp..

[24]  Matthew G. Knepley,et al.  PetFMM—A dynamically load‐balancing parallel fast multipole library , 2009, ArXiv.