Implications of a metric for performance portability

Abstract The term “performance portability” has been informally used in computing to refer to a variety of notions which generally include: (1) the ability to run one application across multiple hardware platforms; and (2) achieving some notional level of performance on these platforms. However, there has been a noticeable lack of consensus on the precise meaning of the term, and authors’ conclusions regarding their success (or failure) to achieve performance portability have thus far been subjective. Comparing one approach to performance portability with another has generally been marked with vague claims and verbose, qualitative explanation of the comparison. This article presents a concise definition for performance portability and an associated metric that accurately capture the performance and portability of an application across different platforms. Through retroactive application of this metric to previous research and a review of numerous programming languages, frameworks and libraries, we devise and suggest tractable approaches to code specialization which can aid the community in developing highly performance-portable applications with minimal impact to productivity.

[1]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[2]  Pradeep Dubey,et al.  Can traditional programming bridge the Ninja performance gap for parallel computing applications , 2012, ISCA 2012.

[3]  Calvin Lin,et al.  Customizing Software Libraries for Performance Portability , 2001, PPSC.

[4]  Bradley C. Kuszmaul,et al.  The pochoir stencil compiler , 2011, SPAA '11.

[5]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Stephen A. Jarvis,et al.  An investigation of the performance portability of OpenCL , 2013, J. Parallel Distributed Comput..

[7]  Yao Zhang,et al.  Improving Performance Portability in OpenCL Programs , 2013, ISC.

[8]  G. R. Mudalige,et al.  OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).

[9]  Matt Martineau,et al.  GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models , 2016, ISC Workshops.

[10]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Stephen A. Jarvis,et al.  Developing Performance-Portable Molecular Dynamics Kernels in OpenCL , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[12]  Dan Quinlan,et al.  The ROSE Source-to-Source Compiler Infrastructure , 2011 .

[13]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[14]  Eric Darve,et al.  Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[16]  Tobias Gysi,et al.  STELLA: a domain-specific tool for structured grid methods in weather and climate models , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Lawrence Mitchell,et al.  PyOP2: A High-Level Framework for Performance-Portable Simulations on Unstructured Meshes , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[18]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[19]  Alejandro Duran,et al.  A Modern Memory Management System for OpenMP , 2016, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD).

[20]  Helmar Burkhart,et al.  PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[21]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[22]  Reinhold Bader Extended interoperation with C , 2013, FORF.

[23]  Simon McIntosh-Smith,et al.  On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures , 2014, ISC.

[24]  Stephen A. Jarvis,et al.  Achieving Portability and Performance through OpenACC , 2014, 2014 First Workshop on Accelerator Programming using Directives.

[25]  Guang R. Gao,et al.  Performance portability on EARTH: a case study across several parallel architectures , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[26]  Putt Sakdhnagool,et al.  Evaluating Performance Portability of OpenACC , 2014, LCPC.

[27]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[28]  Victor W. Lee,et al.  A Metric for Performance Portability , 2016, ArXiv.

[29]  Jean-Sylvain Camier Improving Performance Portability and Exascale Software Productivity with the ∇ Numerical Programming Language , 2015 .

[30]  Sean Rul,et al.  An experimental study on performance portability of OpenCL kernels , 2010, HiPC 2010.

[31]  Simon McIntosh-Smith,et al.  The OPS Domain Specific Abstraction for Multi-block Structured Grid Computations , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[32]  Jason Sewall,et al.  From “Correct” to “Correct & Efficient” , 2015 .

[33]  Stephen A. Jarvis,et al.  Accelerating Hydrocodes with OpenACC, OpenCL and CUDA , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[34]  James E. Smith,et al.  Characterizing computer performance with a single number , 1988, CACM.

[35]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Matt Martineau,et al.  Assessing the performance portability of modern parallel programming models using TeaLeaf , 2017, Concurr. Comput. Pract. Exp..

[37]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[38]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).