FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing

In this paper, we present FLASH 1.0, a C++-based software framework for rapid parallel deployment and enhancing host code portability in heterogeneous computing. FLASH takes a novel approach in describing kernels and dynamically dispatching them in a hardwareagnostic manner. FLASH features truly hardware-agnostic frontend interfaces, which not only unify the compile-time control flow but also enforces a portability-optimized code organization that imposes a demarcation between computational (performance-critical) and functional (non-performance-critical) codes as well as the separation of hardware-specific and hardware-agnostic codes in the host application. We use static code analysis to measure the hardware independence ratio of popular HPC applications and show that up to 99.72% code portability can be achieved with FLASH. Similarly, we measure the complexity of state-of-the-art portable programming models and show that a code reduction of up to 2.2x can be achieved for two common HPC kernels while maintaining 100% code portability with a normalized framework overhead between 1% 13% of the total kernel runtime. The codes are available at https://github.com/PSCLab-ASU/FLASH. CCS CONCEPTS • Software and its engineering→Object oriented frameworks; Software as a service orchestration system.

[1]  Jason Sewall,et al.  Data Parallel C++: Enhancing SYCL Through Extensions for Productivity and Performance , 2020, IWOCL.

[2]  Millad Ghane,et al.  Towards a portable hierarchical view of distributed shared memory systems: challenges and solutions , 2020, PMAM@PPoPP.

[3]  Jakub Szuppe,et al.  Boost.Compute: A parallel computing library for C++ based on OpenCL , 2016, IWOCL.

[4]  Torsten Hoefler,et al.  Porting the COSMO Weather Model to Manycore CPUs , 2019, PASC.

[5]  Charles Jin,et al.  Automatic Parallelization to Asynchronous Task-Based Runtimes Through a Generic Runtime Layer , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[6]  Stephen A. Jarvis,et al.  Achieving Performance Portability for a Heat Conduction Solver Mini-Application on Modern Multi-core Systems , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Pekka Jääskeläinen,et al.  HIPCL: Tool for Porting CUDA Applications to Advanced OpenCL Platforms Through HIP , 2020, IWOCL.

[8]  E. Jason Riedy,et al.  Wrangling Rogues: A Case Study on Managing Experimental Post-Moore Architectures , 2019, PEARC.

[9]  David Abrahams,et al.  C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series) , 2004 .

[10]  Michael Wong,et al.  Towards Heterogeneous and Distributed Computing in C++ , 2019, IWOCL.

[11]  Simon McIntosh-Smith,et al.  Evaluating the performance of HPC-style SYCL applications , 2020, IWOCL.

[12]  Christoph W. Kessler,et al.  Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption , 2017, ARMS-CC@PODC.

[13]  Jun Shirako,et al.  Intrepydd: performance, productivity, and portability for data science application kernels , 2020, Onward!.

[14]  James Brodman,et al.  A SYCL Compiler and Runtime Architecture , 2019, IWOCL.

[15]  Hal Finkel,et al.  Distributed & Heterogeneous Programming in C++ for HPC at SC17 , 2018, IWOCL.

[16]  Olga Pearce,et al.  RAJA: Portable Performance for Large-Scale Scientific Applications , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).