An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes

Diversity among supported architectures in current and emerging high performance computing systems, including those for exascale, makes portable codebases desirable. Portabil- ity of a codebase can be improved using a performance portability layer to provide access to multiple underlying programming mod- els through a single interface. Direct adoption of a performance portability layer, however, poses challenges for large pre-existing software frameworks that may need to preserve legacy code and/or adopt other programming models in the future. This paper describes an approach for indirect adoption that introduces a framework-specific portability layer between the application developer and the adopted performance portability layer to help improve legacy code support and long-term portability for future architectures and programming models. This intermediate layer uses loop-level, application-level, and build-level components to ease adoption of a performance portability layer in large legacy codebases. Results are shown for two challenging case studies using this approach to make portable use of OpenMP and CUDA via Kokkos in an asynchronous many-task runtime system, Uintah. These results show performance improvements up to 2.7x when refactoring for portability and 2.6x when more efficiently using a node. Good strong-scaling to 442,368 threads across 1,728 Knights Landing processors are also shown using MPI+Kokkos at scale.

[1]  Jennifer Spinti,et al.  Large eddy simulations of accidental fires using massively parallel computers , 2003 .

[2]  Timothy G. Mattson,et al.  Evaluating data parallelism in C++ using the Parallel Research Kernels , 2019, IWOCL.

[3]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[4]  Matt Martineau,et al.  Assessing the performance portability of modern parallel programming models using TeaLeaf , 2017, Concurr. Comput. Pract. Exp..

[5]  Daniel J. Rader,et al.  Direct simulation Monte Carlo: The quest for speed , 2014 .

[6]  Daniel Sunderland,et al.  Portably Improving Uintah ’ s Readiness for Exascale Systems Through the Use of Kokkos , .

[7]  Martin Berzins,et al.  Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[8]  Martin Berzins,et al.  Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks , 2017, PEARC.

[9]  Timothy C. Warburton,et al.  OCCA: A unified approach to multi-threading languages , 2014, ArXiv.

[10]  Philipp Grete,et al.  K-Athena: A Performance Portable Structured Grid Finite Volume Magnetohydrodynamics Code , 2019, IEEE Transactions on Parallel and Distributed Systems.

[11]  Martin Berzins,et al.  A Preliminary Port and Evaluation of the Uintah AMT Runtime on Sunway TaihuLight , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[12]  Jeff R. Hammond,et al.  A comparative analysis of Kokkos and SYCL as heterogeneous, parallel programming models for C++ applications , 2019, IWOCL.

[13]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[14]  Qingyu Meng,et al.  Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Matt Martineau,et al.  Evaluating attainable memory bandwidth of parallel programming models via BabelStream , 2018, Int. J. Comput. Sci. Eng..

[16]  Roger P. Pawlowski,et al.  Toward performance portability of the Albany finite element analysis code using the Kokkos library , 2018, Int. J. High Perform. Comput. Appl..

[17]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[18]  Richard D. Hornung,et al.  The RAJA Portability Layer: Overview and Status , 2014 .

[19]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Tamara G. Kolda,et al.  Software for Sparse Tensor Decomposition on Emerging Computing Architectures , 2018, SIAM J. Sci. Comput..

[21]  Martin Berzins,et al.  An Overview of Performance Portability in the Uintah Runtime System through the Use of Kokkos , 2016, 2016 Second International Workshop on Extreme Scale Programming Models and Middlewar (ESPM2).

[22]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[23]  David Moxey,et al.  Accelerating high-order mesh optimisation with an architecture-independent programming model , 2018, Comput. Phys. Commun..

[24]  Martin Berzins,et al.  Demonstrating GPU code portability and scalability for radiative heat transfer computations , 2018, J. Comput. Sci..

[25]  Martin Berzins,et al.  ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms , 2015 .

[26]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[27]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[28]  John Shalf,et al.  BoxLib with Tiling: An AMR Software Framework , 2016, ArXiv.

[29]  Marcus S. Day,et al.  AMReX: a framework for block-structured adaptive mesh refinement , 2019, J. Open Source Softw..

[30]  Bok Jik Lee,et al.  Direct numerical simulations of reacting flows with detailed chemistry using many-core/GPU acceleration , 2018, Computers & Fluids.

[31]  Andrew M. Bradley,et al.  HOMMEXX 1.0: a performance-portable atmospheric dynamical core for the Energy Exascale Earth System Model , 2019, Geoscientific Model Development.

[32]  Martin Berzins,et al.  A Scalable Algorithm for Radiative Heat Transfer Using Reverse Monte Carlo Ray Tracing , 2015, ISC.

[33]  Martin Schulz,et al.  ARCHER: Effectively Spotting Data Races in Large OpenMP Applications , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[34]  Qingyu Meng,et al.  Extending the Uintah Framework through the Petascale Modeling of Detonation in Arrays of High Explosive Devices , 2016, SIAM J. Sci. Comput..

[35]  John Shalf,et al.  The Cactus Framework and Toolkit: Design and Applications , 2002, VECPAR.

[36]  Martin Berzins,et al.  Developing Uintah ’ s Runtime System For Forthcoming Architectures , 2015 .

[37]  Brian van Straalen,et al.  A survey of high level frameworks in block-structured adaptive mesh refinement packages , 2014, J. Parallel Distributed Comput..

[38]  Konstantin Serebryany,et al.  ThreadSanitizer: data race detection in practice , 2009, WBIA '09.

[39]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.