Abstraction Layer For Standardizing APIs of Task-Based Engines

We introduce <monospace>AL4SAN</monospace>, a lightweight library for abstracting the APIs of task-based runtime engines. <monospace>AL4SAN</monospace> unifies the expression of tasks and their data dependencies. It supports various dynamic runtime systems relying on compiler technology and user-defined APIs. It enables a single application to employ different runtimes and their respective scheduling components, while providing user-obliviousness to the underlying hardware configurations. <monospace>AL4SAN</monospace> exposes common front-end APIs and connects to different back-end runtimes. Experiments on performance and overhead assessments are reported on various shared- and distributed-memory systems, possibly equipped with hardware accelerators. A range of workloads, from compute-bound to memory-bound regimes, are employed as proxies for current scientific applications. The low overhead (less than 10 percent) achieved using a variety of workloads enables <monospace>AL4SAN</monospace> to be deployed for fast development of task-based numerical algorithms. More interestingly, <monospace>AL4SAN</monospace> enables runtime interoperability by switching runtimes at runtime. Blending runtime systems permits to achieve a twofold speedup on a task-based generalized symmetric eigenvalue solver, relative to state-of-the-art implementations. The ultimate goal of <monospace>AL4SAN</monospace> is not to create a new runtime, but to strengthen co-design of existing runtimes/applications, while facilitating user productivity and code portability. The code of <monospace>AL4SAN</monospace> is freely available at <uri>https://github.com/ecrc/al4san</uri>, with extensions in progress.

[1]  Samuel Thibault,et al.  Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite , 2014, IWOMP.

[2]  Jack J. Dongarra,et al.  Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures , 2011, PARCO.

[3]  David E. Keyes,et al.  Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[4]  Emmanuel Agullo,et al.  Task-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[5]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[6]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[7]  Bruno Lang Efficient eigenvalue and singular value computations on shared memory machines , 1999, Parallel Comput..

[8]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[9]  Siegfried Benkner,et al.  Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[10]  Olga Pearce,et al.  RAJA: Portable Performance for Large-Scale Scientific Applications , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[11]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[12]  Jack J. Dongarra,et al.  A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks , 2014, Int. J. High Perform. Comput. Appl..

[13]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks , 2009, International Journal of Parallel Programming.

[14]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[15]  Thomas Heller,et al.  Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.

[16]  Eric Gendron,et al.  Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs , 2016, PASC.

[17]  Jack J. Dongarra,et al.  Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18]  Christian H. Bischof,et al.  Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[19]  Kostas Katrinis,et al.  A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.

[20]  Courtenay T. Vaughan,et al.  ASC Tri-lab Co-design Level 2 Milestone Report 2015 , 2015 .

[21]  Qingyu Meng,et al.  Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Philipp Birken,et al.  Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[23]  Ronald Kriemann,et al.  H-LU Factorization on Many-Core Systems , 2014 .

[24]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[25]  A. Stathopoulos,et al.  Solution of large eigenvalue problems in electronic structure calculations , 1996 .

[26]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28]  Jack J. Dongarra,et al.  Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Yousef Saad,et al.  PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[31]  Richard J. Simard,et al.  Computing the Two-Sided Kolmogorov-Smirnov Distribution , 2011 .

[32]  Jesús Labarta,et al.  Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing , 2015, Supercomput. Front. Innov..

[33]  Lukas Krämer,et al.  Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations , 2011, Parallel Comput..

[34]  George Bosilca,et al.  Accelerating NWChem Coupled Cluster through dataflow-based execution , 2015, PPAM.

[35]  David E. Keyes,et al.  Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions , 2017, ISC.

[36]  Jack Dongarra,et al.  Designing SLATE: Software for Linear Algebra Targeting Exascale , 2017 .

[37]  Raúl Sánchez,et al.  Event-based parareal: A data-flow based implementation of parareal , 2012, J. Comput. Phys..

[38]  A Marek,et al.  The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science , 2014, Journal of physics. Condensed matter : an Institute of Physics journal.

[39]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[40]  Martin Berzins,et al.  ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms , 2015 .

[41]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[42]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[43]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[44]  W. Hackbusch,et al.  Hierarchical Matrices: Algorithms and Analysis , 2015 .

[45]  Laxmikant V. Kalé,et al.  Runtime Coordinated Heterogeneous Tasks in Charm++ , 2016, 2016 Second International Workshop on Extreme Scale Programming Models and Middlewar (ESPM2).

[46]  David E. Keyes,et al.  Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture , 2017, Euro-Par.

[47]  Jack J. Dongarra,et al.  Porting the PLASMA Numerical Library to the OpenMP Standard , 2017, International Journal of Parallel Programming.