With the advent of complex modern architectures, the low-level
paradigms long considered sufficient to build High Performance Computing (HPC)
numerical codes have met their limits. Achieving efficiency, ensuring
portability, while preserving programming tractability on such hardware
prompted the HPC community to design new, higher level paradigms.
The successful ports of fully-featured numerical libraries on several
recent runtime system proposals have shown, indeed, the benefit of
task-based parallelism models in terms of performance portability on
complex platforms. However, the common weakness of these projects is to
deeply tie applications to specific expert-only runtime system APIs. The
\omp specification, which aims at providing a common parallel
programming means for shared-memory platforms, appears as a good
candidate to address this issue thanks to the latest task-based
constructs introduced as part of its revision 4.0.
The goal of this paper is to assess the effectiveness and limits of
this support for designing a high-performance numerical library. We
illustrate our discussion with the \scalfmm library, which implements
state-of-the-art fast multipole method (FMM) algorithms, that we
have deeply re-designed with respect to the most advanced
features provided by \omp 4. We show that \omp 4 allows for
significant performance improvements over previous \omp revisions on
recent multicore processors. We furthermore propose extensions to the
\omp 4 standard and show how they can enhance FMM performance. To
assess our statement, we have implemented this support within the
\klanglong source-to-source compiler that translates \omp directives into
calls to the \starpu task-based runtime system. This study shows that
we can take advantage of the advanced capabilities of a fully-featured
runtime system without resorting to a specific, native runtime port,
hence bridging the gap between the \omp standard and the very high
performance that was so far reserved to expert-only runtime system
APIs.