论文信息 - FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing

FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing

In this paper, we present FLASH 1.0, a C++-based software framework for rapid parallel deployment and enhancing host code portability in heterogeneous computing. FLASH takes a novel approach in describing kernels and dynamically dispatching them in a hardwareagnostic manner. FLASH features truly hardware-agnostic frontend interfaces, which not only unify the compile-time control flow but also enforces a portability-optimized code organization that imposes a demarcation between computational (performance-critical) and functional (non-performance-critical) codes as well as the separation of hardware-specific and hardware-agnostic codes in the host application. We use static code analysis to measure the hardware independence ratio of popular HPC applications and show that up to 99.72% code portability can be achieved with FLASH. Similarly, we measure the complexity of state-of-the-art portable programming models and show that a code reduction of up to 2.2x can be achieved for two common HPC kernels while maintaining 100% code portability with a normalized framework overhead between 1% 13% of the total kernel runtime. The codes are available at https://github.com/PSCLab-ASU/FLASH. CCS CONCEPTS • Software and its engineering→Object oriented frameworks; Software as a service orchestration system.

Fengbo Ren | Erfan Bank Tavakoli | Masudul Hassan Quraishi | Michael Riera

[1] Jason Sewall,et al. Data Parallel C++: Enhancing SYCL Through Extensions for Productivity and Performance , 2020, IWOCL.

[2] Millad Ghane,et al. Towards a portable hierarchical view of distributed shared memory systems: challenges and solutions , 2020, PMAM@PPoPP.

[3] Jakub Szuppe,et al. Boost.Compute: A parallel computing library for C++ based on OpenCL , 2016, IWOCL.

[4] Torsten Hoefler,et al. Porting the COSMO Weather Model to Manycore CPUs , 2019, PASC.

[5] Charles Jin,et al. Automatic Parallelization to Asynchronous Task-Based Runtimes Through a Generic Runtime Layer , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[6] Stephen A. Jarvis,et al. Achieving Performance Portability for a Heat Conduction Solver Mini-Application on Modern Multi-core Systems , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[7] Pekka Jääskeläinen,et al. HIPCL: Tool for Porting CUDA Applications to Advanced OpenCL Platforms Through HIP , 2020, IWOCL.

[8] E. Jason Riedy,et al. Wrangling Rogues: A Case Study on Managing Experimental Post-Moore Architectures , 2019, PEARC.

[9] David Abrahams,et al. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond (C++ In-Depth Series) , 2004 .

[10] Michael Wong,et al. Towards Heterogeneous and Distributed Computing in C++ , 2019, IWOCL.

[11] Simon McIntosh-Smith,et al. Evaluating the performance of HPC-style SYCL applications , 2020, IWOCL.

[12] Christoph W. Kessler,et al. Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption , 2017, ARMS-CC@PODC.

[13] Jun Shirako,et al. Intrepydd: performance, productivity, and portability for data science application kernels , 2020, Onward!.

[14] James Brodman,et al. A SYCL Compiler and Runtime Architecture , 2019, IWOCL.

[15] Hal Finkel,et al. Distributed & Heterogeneous Programming in C++ for HPC at SC17 , 2018, IWOCL.

[16] Olga Pearce,et al. RAJA: Portable Performance for Large-Scale Scientific Applications , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).