Towards High-Performance Code Generation for Multi-GPU Clusters Based on a Domain-Specific Language for Algorithmic Skeletons

In earlier work, we defined a domain-specific language (DSL) with the aim to provide an easy-to-use approach for programming multi-core and multi-GPU clusters. The DSL incorporates the idea of utilizing algorithmic skeletons, which are well-known patterns for parallel programming, such as map and reduce. Based on the chosen skeleton, a user-defined function can be applied to a data structure in parallel with the main advantage that the user does not have to worry about implementation details. So far, we had only implemented a generator for multi-core clusters and in this paper we present and evaluate two prototypes of generators for multi-GPU clusters, which are based on OpenACC and CUDA. We have evaluated the approach with four benchmark applications. The results show that the generation approach leads to execution times, which are on par with an alternative library implementation.

[1]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[2]  Herbert Kuchen,et al.  Musket: a domain-specific language for high-level parallel programming with algorithmic skeletons , 2019, SAC.

[3]  Herbert Kuchen,et al.  Optimizing Sequences of Skeleton Calls , 2003, Domain-Specific Program Generation.

[4]  Herbert Kuchen,et al.  Algorithmic skeletons for multi-core, multi-GPU systems and clusters , 2012, Int. J. High Perform. Comput. Netw..

[5]  Herbert Kuchen,et al.  Generation of high-performance code based on a domain-specific language for algorithmic skeletons , 2019, The Journal of Supercomputing.

[6]  Volker Gruhn,et al.  Model-Driven Software Development , 2005 .

[7]  Herbert Kuchen,et al.  A Skeleton Library , 2002, Euro-Par.

[8]  Christoph W. Kessler,et al.  SkePU 2: Flexible and Type-Safe Skeleton Programming for Heterogeneous Parallel Systems , 2018, International Journal of Parallel Programming.

[9]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[10]  Marco Danelutto,et al.  Parallel Patterns for General Purpose Many-Core , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[11]  LindleySam,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015 .

[12]  M Mernik,et al.  When and how to develop domain-specific languages , 2005, CSUR.

[13]  Hannes Schwarz,et al.  Model-Driven Software Development , 2013 .

[14]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[15]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16]  Marco Danelutto,et al.  FastFlow: High-level and Efficient Streaming on Multi-core , 2017 .

[17]  Kiminori Matsuzaki,et al.  Implementing Fusion-Equipped Parallel Skeletons by Expression Templates , 2009, IFL.

[18]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[19]  Herbert Kuchen,et al.  Data Parallel Algorithmic Skeletons with Accelerator Support , 2017, International Journal of Parallel Programming.

[20]  Peter Kilpatrick,et al.  Targeting Distributed Systems in FastFlow , 2012, Euro-Par Workshops.

[21]  Other Contributors Are Indicated Where They Contribute The Eclipse Foundation , 2017 .

[22]  Marco Danelutto,et al.  SPar: A DSL for High-Level and Productive Stream Parallelism , 2017, Parallel Process. Lett..