论文信息 - Abstraction Layer For Standardizing APIs of Task-Based Engines

Abstraction Layer For Standardizing APIs of Task-Based Engines

We introduce <monospace>AL4SAN</monospace>, a lightweight library for abstracting the APIs of task-based runtime engines. <monospace>AL4SAN</monospace> unifies the expression of tasks and their data dependencies. It supports various dynamic runtime systems relying on compiler technology and user-defined APIs. It enables a single application to employ different runtimes and their respective scheduling components, while providing user-obliviousness to the underlying hardware configurations. <monospace>AL4SAN</monospace> exposes common front-end APIs and connects to different back-end runtimes. Experiments on performance and overhead assessments are reported on various shared- and distributed-memory systems, possibly equipped with hardware accelerators. A range of workloads, from compute-bound to memory-bound regimes, are employed as proxies for current scientific applications. The low overhead (less than 10 percent) achieved using a variety of workloads enables <monospace>AL4SAN</monospace> to be deployed for fast development of task-based numerical algorithms. More interestingly, <monospace>AL4SAN</monospace> enables runtime interoperability by switching runtimes at runtime. Blending runtime systems permits to achieve a twofold speedup on a task-based generalized symmetric eigenvalue solver, relative to state-of-the-art implementations. The ultimate goal of <monospace>AL4SAN</monospace> is not to create a new runtime, but to strengthen co-design of existing runtimes/applications, while facilitating user productivity and code portability. The code of <monospace>AL4SAN</monospace> is freely available at <uri>https://github.com/ecrc/al4san</uri>, with extensions in progress.

[1] Samuel Thibault,et al. Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite , 2014, IWOMP.

[2] Jack J. Dongarra,et al. Solving the Generalized Symmetric Eigenvalue Problem using Tile Algorithms on Multicore Architectures , 2011, PARCO.

[3] David E. Keyes,et al. Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[4] Emmanuel Agullo,et al. Task-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[5] Thomas Hérault,et al. PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[6] George Bosilca,et al. Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[7] Bruno Lang. Efficient eigenvalue and singular value computations on shared memory machines , 1999, Parallel Comput..

[8] Emmanuel Agullo,et al. Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[9] Siegfried Benkner,et al. Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[10] Olga Pearce,et al. RAJA: Portable Performance for Large-Scale Scientific Applications , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[11] Daniel Sunderland,et al. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[12] Jack J. Dongarra,et al. A novel hybrid CPU–GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks , 2014, Int. J. High Perform. Comput. Appl..

[13] Alejandro Duran,et al. A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks , 2009, International Journal of Parallel Programming.

[14] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[15] Thomas Heller,et al. Application of the ParalleX execution model to stencil-based problems , 2012, Computer Science - Research and Development.

[16] Eric Gendron,et al. Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs , 2016, PASC.

[17] Jack J. Dongarra,et al. Two-Stage Tridiagonal Reduction for Dense Symmetric Matrices Using Tile Algorithms on Multicore Architectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18] Christian H. Bischof,et al. Algorithm 807: The SBR Toolbox—software for successive band reduction , 2000, TOMS.

[19] Kostas Katrinis,et al. A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.

[20] Courtenay T. Vaughan,et al. ASC Tri-lab Co-design Level 2 Milestone Report 2015 , 2015 .

[21] Qingyu Meng,et al. Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22] Philipp Birken,et al. Numerical Linear Algebra , 2011, Encyclopedia of Parallel Computing.

[23] Ronald Kriemann,et al. H-LU Factorization on Many-Core Systems , 2014 .

[24] Christina Freytag,et al. Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[25] A. Stathopoulos,et al. Solution of large eigenvalue problems in electronic structure calculations , 1996 .

[26] Alexander Aiken,et al. Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27] Alejandro Duran,et al. Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28] Jack J. Dongarra,et al. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29] Yousef Saad,et al. PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30] Thomas Hérault,et al. Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[31] Richard J. Simard,et al. Computing the Two-Sided Kolmogorov-Smirnov Distribution , 2011 .

[32] Jesús Labarta,et al. Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing , 2015, Supercomput. Front. Innov..

[33] Lukas Krämer,et al. Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations , 2011, Parallel Comput..

[34] George Bosilca,et al. Accelerating NWChem Coupled Cluster through dataflow-based execution , 2015, PPAM.

[35] David E. Keyes,et al. Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions , 2017, ISC.

[36] Jack Dongarra,et al. Designing SLATE: Software for Linear Algebra Targeting Exascale , 2017 .

[37] Raúl Sánchez,et al. Event-based parareal: A data-flow based implementation of parareal , 2012, J. Comput. Phys..

[38] A Marek,et al. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science , 2014, Journal of physics. Condensed matter : an Institute of Physics journal.

[39] Asim YarKhan,et al. Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[40] Martin Berzins,et al. ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms , 2015 .

[41] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .

[42] Jack Dongarra,et al. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[43] Jack Dongarra,et al. ScaLAPACK Users' Guide , 1987 .

[44] W. Hackbusch,et al. Hierarchical Matrices: Algorithms and Analysis , 2015 .

[45] Laxmikant V. Kalé,et al. Runtime Coordinated Heterogeneous Tasks in Charm++ , 2016, 2016 Second International Workshop on Extreme Scale Programming Models and Middlewar (ESPM2).

[46] David E. Keyes,et al. Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture , 2017, Euro-Par.

[47] Jack J. Dongarra,et al. Porting the PLASMA Numerical Library to the OpenMP Standard , 2017, International Journal of Parallel Programming.