An investigation of the performance portability of OpenCL

This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3-1.5x slower than native FORTRAN 77 or CUDA implementations on a single node and 1.3-3.1x slower on multiple nodes. We also explore the potential performance gains of OpenCL's device fissioning capability, demonstrating up to a 3x speed-up over our original OpenCL implementation.

[1]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[2]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[3]  Jing Xie,et al.  Optimizing Sweep3D for Graphic Processor Unit , 2010, ICA3PP.

[4]  Mary K. Vernon,et al.  A plug-and-play model for evaluating wavefront computations on parallel architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  Stephen A. Jarvis,et al.  On the Acceleration of Wavefront Applications using Distributed Many-Core Architectures , 2012, Comput. J..

[6]  Simon McIntosh-Smith,et al.  Energy-aware metrics for benchmarking heterogeneous systems , 2011, PERV.

[7]  Stephen A. Jarvis,et al.  Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark , 2011, PERV.

[8]  S. D. Hammond,et al.  Performance Analysis of a Hybrid MPI / CUDA Implementation of the NAS-LU Benchmark , 2010 .

[9]  Christian Brecher,et al.  Simulation of bevel gear cutting with GPGPUs—performance and productivity , 2011, Computer Science - Research and Development.

[10]  Paul H. J. Kelly,et al.  Performance analysis of the OP2 framework on many-core architectures , 2011, PERV.

[11]  Fumihiko Ino,et al.  Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.

[12]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[13]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[14]  Akila Gothandaraman,et al.  Comparing Hardware Accelerators in Scientific Applications: A Case Study , 2011, IEEE Transactions on Parallel and Distributed Systems.

[15]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..