Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method

With the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring portability, while preserving programming tractability on such hardware prompted the HPC community to design new, higher level paradigms while relying on runtime systems to maintain performance. However, the common weakness of these projects is to deeply tie applications to specific expert-only runtime system APIs. The OpenMP specification, which aims at providing common parallel programming means for shared-memory platforms, appears as a good candidate to address this issue thanks to the latest task-based constructs introduced in its revision 4.0. The goal of this paper is to assess the effectiveness and limits of this support for designing a high-performance numerical library, ScalFMM, implementing the fast multipole method (FMM) that we have deeply re-designed with respect to the most advanced features provided by OpenMP 4. We show that OpenMP 4 allows for significant performance improvements over previous OpenMP revisions on recent multicore processors and that extensions to the 4.0 standard allow for strongly improving the performance, bridging the gap with the very high performance that was so far reserved to expert-only runtime system APIs.

[1]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[2]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[3]  Hatem Ltaief,et al.  Data‐driven execution of fast multipole methods , 2012, Concurr. Comput. Pract. Exp..

[4]  Bronis R. de Supinski,et al.  A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries , 2010, IWOMP.

[5]  Alexander Aiken,et al.  Regent: a high-productivity programming language for HPC with logical regions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[7]  Laxmikant V. Kale,et al.  Programming heterogeneous clusters with accelerators using object-based programming , 2011 .

[8]  Benoit Lange,et al.  Parallel Dual Tree Traversal on Multi-core and Many-core Architectures for Astrophysical N-body Simulations , 2014, Euro-Par.

[9]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[10]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[11]  Eduard Ayguadé,et al.  OpenMP tasks in IBM XL compilers , 2008, CASCON '08.

[12]  Samuel Williams,et al.  Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  James LaGrone,et al.  OpenMP 3 . 0 Tasking Implementation in OpenUH ∗ , 2009 .

[14]  Eric F Darve,et al.  Fast hierarchical algorithms for generating Gaussian random fields , 2015 .

[15]  Richard W. Vuduc,et al.  Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[17]  Henri Casanova,et al.  Parallel Algorithms , 2019, Design and Analysis of Algorithms.

[18]  Lorena A. Barba,et al.  A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems , 2011, Int. J. High Perform. Comput. Appl..

[19]  Bruno Raffin,et al.  Locality-Aware Work Stealing on Multi-CPU and Multi-GPU Architectures , 2013 .

[20]  Emmanuel Agullo,et al.  Task-Based FMM for Multicore Architectures , 2014, SIAM J. Sci. Comput..

[21]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[22]  Emmanuel Agullo,et al.  Task‐based FMM for heterogeneous architectures , 2016, Concurr. Comput. Pract. Exp..

[23]  Alejandro Duran,et al.  Mercurium: Design Decisions for a S2S Compiler , 2011 .