Portably Improving Uintah ’ s Readiness for Exascale Systems Through the Use of Kokkos

Uncertainty and diversity in future HPC systems, including those for exascale, makes portable codebases desirable. To ease future ports, the Uintah Computational Framework has adopted the Kokkos C++ Performance Portability Library. This paper describes infrastructure advancements and performance improvements using partitioning functionality recently added to Kokkos within Uintah’s MPI+Kokkos hybrid parallelism approach. Results are presented for two challenging calculations that have been refactored to support Kokkos::OpenMP and Kokkos::Cuda back-ends. These results demonstrate performance improvements up to (i) 2.66x when refactoring for portability, (ii) 81.59x when adding loop-level parallelism via Kokkos back-ends, and (iii) 2.63x when more efficiently using a node. Good strong-scaling characteristics to 442,368 threads across 1728 Knights Landing processors are also shown. These improvements have been achieved with little added overhead (sub-millisecond, consuming up to 0.18% of per-timestep time). Kokkos adoption and refactoring lessons are also discussed. Portably Improving Uintah’s Readiness for Exascale Systems Through the Use of Kokkos John K. Holmen, Brad Peterson, Alan Humphrey, Daniel Sunderland, Oscar H. Dı́az-Ibarra, Jeremy N. Thornock, Martin Berzins aScientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT 84112 bSandia National Laboratories, PO Box 5800 / MS 1418, Albuquerque, NM 87175 cInstitute for Clean and Secure Energy, University of Utah, Salt Lake City, UT 84112

[1]  Martin Berzins,et al.  A Scalable Algorithm for Radiative Heat Transfer Using Reverse Monte Carlo Ray Tracing , 2015, ISC.

[2]  Hartmut Kaiser,et al.  HPX: A Task Based Programming Model in a Global Address Space , 2014, PGAS.

[3]  Sebastiano Vigna,et al.  An Experimental Exploration of Marsaglia's xorshift Generators, Scrambled , 2014, ACM Trans. Math. Softw..

[4]  Qingyu Meng,et al.  Investigating applications portability with the uintah DAG-based runtime system on petascale supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[5]  Justin Luitjens,et al.  Dynamic task scheduling for the Uintah framework , 2010, 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers.

[6]  M. Zingale,et al.  Meeting the Challenges of Modeling Astrophysical Thermonuclear Explosions: Castro, Maestro, and the AMReX Astrophysics Suite , 2017, 1711.06203.

[7]  Justin Luitjens,et al.  Uintah: a scalable framework for hazard analysis , 2010, TG.

[8]  Jeremy N. Thornock,et al.  Application of LES-CFD for predicting pulverized-coal working conditions after installation of NOx control system , 2018, Energy.

[9]  Martin Berzins,et al.  Improving Uintah's Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks , 2017, PEARC.

[10]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[11]  Martin Berzins,et al.  Demonstrating GPU code portability and scalability for radiative heat transfer computations , 2018, J. Comput. Sci..

[12]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[13]  Martin Berzins,et al.  Chapter 13 – Exploring Use of the Reserved Core , 2015 .

[14]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[15]  Richard D. Hornung,et al.  The RAJA Portability Layer: Overview and Status , 2014 .

[16]  Martin Berzins,et al.  An Overview of Performance Portability in the Uintah Runtime System through the Use of Kokkos , 2016, 2016 Second International Workshop on Extreme Scale Programming Models and Middlewar (ESPM2).

[17]  John Shalf,et al.  The Cactus Framework and Toolkit: Design and Applications , 2002, VECPAR.

[18]  Christon,et al.  Spatial domain-based parallelism in large scale, participating-media, radiative transport applications , 1996 .

[19]  Qingyu Meng,et al.  Using hybrid parallelism to improve memory use in the Uintah framework , 2011 .

[20]  Matt Martineau,et al.  An Evaluation of Emerging Many-Core Parallel Programming Models , 2016, PMAM@PPoPP.

[21]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[22]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Martin Berzins,et al.  A Preliminary Port and Evaluation of the Uintah AMT Runtime on Sunway TaihuLight , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[24]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[25]  Jennifer Spinti,et al.  Large eddy simulations of accidental fires using massively parallel computers , 2003 .

[26]  Jeremy N. Thornock,et al.  Large eddy simulation of polydisperse particles in turbulent coaxial jets using the direct quadrature method of moments , 2014 .

[27]  Qingyu Meng,et al.  Radiation modeling using the Uintah heterogeneous CPU/GPU runtime system , 2012, XSEDE '12.

[28]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[29]  Martin Berzins,et al.  Radiative Heat Transfer Calculation on 16384 GPUs Using a Reverse Monte Carlo Ray Tracing Approach with Adaptive Mesh Refinement , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).