Performance portable GPU code generation for matrix multiplication

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device. Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device-specific forms, from which OpenCL code is generated. In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -- but provably correct -- implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD's clBLAS library.

[1]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[2]  Sebastian Hack,et al.  A graph-based higher-order intermediate representation , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Trevor L. McDonell Optimising purely functional GPU programs , 2013, ICFP.

[4]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[5]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[6]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[7]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[8]  Edward G. Coffman,et al.  Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[9]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[10]  Kunle Olukotun,et al.  Locality-Aware Mapping of Nested Parallel Patterns on GPUs , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  H. Corporaal,et al.  Bones , 2014, ACM Trans. Archit. Code Optim..

[12]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[13]  Stanislav G. Sedukhin,et al.  Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[14]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[15]  Jack Dongarra,et al.  clMAGMA: high performance dense linear algebra with OpenCL , 2014, IWOCL '14.

[16]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[18]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[19]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[20]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.