On Applying Performance Portability Metrics

As we prepare for further technological advance- ment in supercomputing, the diversity of hardware architec- tures and parallel programming languages has increased to new levels. At the same time, extracting performance from so many architectures is even more difficult. In this context, the appearance of portable languages capable of generating executable code for multiple architectures has become a recurrent research target. We port a set of seven parallel benchmarks from SPEC ACCEL suite and a wave propagation code to one such portable language: the Kokkos C++ programming library. Using the original OpenACC versions of the eight codes, we apply a known performance portability metric on the OpenACC and Kokkos versions of those codes across a variety of hardware platforms and problem sizes. We observe that the portability metric is sensitive to the problem size. To remedy this deficiency, we propose a novel metric for performance portability, apply the proposed metric to the eight codes and discuss the results.

[1]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[2]  Stephen A. Jarvis,et al.  Achieving Performance Portability for a Heat Conduction Solver Mini-Application on Modern Multi-core Systems , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Michael Frumkin,et al.  Implementation of NAS Parallel Benchmarks in High Performance Fortran , 2000 .

[4]  Sunita Chandrasekaran,et al.  SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance , 2014, PMBS@SC.

[5]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[6]  Victor W. Lee,et al.  A Metric for Performance Portability , 2016, ArXiv.

[7]  Jason Sewall,et al.  Effective Performance Portability , 2018, 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[8]  Christoph W. Kessler,et al.  Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: programming productivity, performance, and energy consumption , 2017, ARMS-CC@PODC.

[9]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[10]  Ana Lucia Varbanescu,et al.  A Beginner's Guide to Estimating and Improving Performance Portability , 2018, ISC Workshops.

[11]  Victor W. Lee,et al.  Implications of a metric for performance portability , 2017, Future Gener. Comput. Syst..

[12]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[13]  Ulrich Rüde,et al.  Optimization and Profiling of the Cache Performance of Parallel Lattice Boltzmann Codes in 2 D and 3 D ∗ , 2003 .

[14]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[15]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[16]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[18]  Jeffrey C. Carver,et al.  Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers , 2005, ACM/IEEE SC 2005 Conference (SC'05).