Data Parallel Algorithmic Skeletons with Accelerator Support

Hardware accelerators such as GPUs or Intel Xeon Phi comprise hundreds or thousands of cores on a single chip and promise to deliver high performance. They are widely used to boost the performance of highly parallel applications. However, because of their diverging architectures programmers are facing diverging programming paradigms. Programmers also have to deal with low-level concepts of parallel programming that make it a cumbersome task. In order to assist programmers in developing parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallel programming patterns, thereby shielding programmers from low-level aspects of parallel programming. The main contribution of this paper is a comparison of two skeleton library implementations, one in C++ and one in Java, in terms of library design and programmability. Besides, on the basis of four benchmark applications we evaluate the performance of the presented implementations on two test systems, a GPU cluster and a Xeon Phi system. The two implementations achieve comparable performance with a slight advantage for the C++ implementation. Xeon Phi performance ranges between CPU and GPU performance.

[1]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[2]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[3]  Herbert Kuchen,et al.  Data Parallel Skeletons in Java , 2012, ICCS.

[4]  Mark Baker,et al.  Nested parallelism for multi-core HPC systems using Java , 2009, J. Parallel Distributed Comput..

[5]  Sabela Ramos,et al.  Evaluation of Java for General Purpose GPU Computing , 2013, 2013 27th International Conference on Advanced Information Networking and Applications Workshops.

[6]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[7]  Herbert Kuchen,et al.  Algorithmic Skeletons for Branch and Bound , 2006, ICSOFT.

[8]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[9]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10]  Herbert Kuchen,et al.  Algorithmic skeletons for multi-core, multi-GPU systems and clusters , 2012, Int. J. High Perform. Comput. Netw..

[11]  Herbert Kuchen,et al.  Higher-order functions and partial applications for a C++ skeleton library , 2002, JGI '02.

[12]  Marco Aldinucci,et al.  FastFlow: Combining Pattern-Level Abstraction and Efficiency in GPGPUs , 2014 .

[13]  Herbert Kuchen,et al.  Skeletons for divide and conquer algorithms , 2008 .

[14]  Vivek Sarkar,et al.  JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA , 2009, Euro-Par.

[15]  George Chrysos,et al.  Intel® Xeon Phi coprocessor (codename Knights Corner) , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).

[16]  Herbert Kuchen,et al.  Java Implementation of Data Parallel Skeletons on GPUs , 2015, PARCO.

[17]  Murray Cole,et al.  Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming , 2004, Parallel Comput..

[18]  Leonie Kohl,et al.  Parallel Programming In C With Mpi And Open Mp , 2016 .

[19]  Ronald H. Perrott,et al.  Parallel programming , 1988, International computer science series.