Assessing the performance portability of modern parallel programming models using TeaLeaf

In this work, we evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port Tealeaf, a miniature proxy application, or mini app, that solves the heat conduction equation and belongs to the Mantevo Project. We find that the best performance is achieved with architecture‐specific implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5% to 30% performance penalty. While the models expose varying levels of complexity to the developer, they all achieve reasonable performance with this application. As such, if this small performance penalty is permissible for a problem domain, we believe that productivity and development complexity can be considered the major differentiators when choosing a modern parallel programming model to develop applications like Tealeaf.

[1]  Seyong Lee,et al.  Early evaluation of directive-based GPU programming models for productive exascale computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Matthias S. Müller,et al.  OpenMP in the Era of Low Power Devices and Accelerators , 2013, Lecture Notes in Computer Science.

[3]  Yao Zhang,et al.  Improving Performance Portability in OpenCL Programs , 2013, ISC.

[4]  Christian Terboven,et al.  A Pattern-Based Comparison of OpenACC and OpenMP for Accelerator Computing , 2014, Euro-Par.

[5]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[6]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[7]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[8]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[9]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[10]  Arthur W. Toga,et al.  CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms , 2012, Comput. Methods Programs Biomed..

[11]  Jun Kong,et al.  Comparative Performance Analysis of Intel (R) Xeon Phi (TM), GPU, and CPU: A Case Study from Microscopy Image Analysis , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Stephen A. Jarvis,et al.  Towards Portable Performance for Explicit Hydrodynamics Codes , 2013 .

[13]  D. A. Beckingsale,et al.  TeaLeaf: A New Mini-Application for Many-Core Aware, Iterative Sparse Linear Solvers , 2015, IPDPS 2015.

[14]  Karl Rupp,et al.  Performance portability study of linear algebra kernels in OpenCL , 2014, IWOCL '14.

[15]  Simon McIntosh-Smith,et al.  The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[16]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[17]  Richard D. Hornung,et al.  The RAJA Portability Layer: Overview and Status , 2014 .

[18]  Stephen A. Jarvis,et al.  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[19]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[20]  Bronis R. de Supinski,et al.  Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[21]  Simon McIntosh-Smith,et al.  On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures , 2014, ISC.

[22]  J. A. Herdman,et al.  Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems , 2014, PMBS@SC.

[23]  Kevin O'Brien,et al.  Performance analysis of OpenMP on a GPU using a CORAL proxy application , 2015, PMBS '15.

[24]  John Shalf,et al.  Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture , 2013, Computing in Science & Engineering.

[25]  David A. Padua,et al.  Performance Portability with the Chapel Language , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[26]  Wu-chun Feng,et al.  CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-Core Architectures , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[27]  Alistair Hart First Experiences Porting a Parallel Application to a Hybrid Supercomputer with OpenMP4.0 Device Constructs , 2015, IWOMP.

[28]  Malcolm Atkinson,et al.  High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion: , 2012 .