论文信息 - Performance portable GPU code generation for matrix multiplication

Performance portable GPU code generation for matrix multiplication

Parallel accelerators such as GPUs are notoriously hard to program; exploiting their full performance potential is a job best left for ninja programmers. High-level programming languages coupled with optimizing compilers have been proposed to attempt to address this issue. However, they rely on device-specific heuristics or hard-coded library implementations to achieve good performance resulting in non-portable solutions that need to be re-optimized for every new device. Achieving performance portability is the holy grail of high-performance computing and has so far remained an open problem even for well studied applications like matrix multiplication. We argue that what is needed is a way to describe applications at a high-level without committing to particular implementations. To this end, we developed in a previous paper a functional data-parallel language which allows applications to be expressed in a device neutral way. We use a set of well-defined rewrite rules to automatically transform programs into semantically equivalent device-specific forms, from which OpenCL code is generated. In this paper, we demonstrate how this approach produces high-performance OpenCL code for GPUs with a well-studied, well-understood application: matrix multiplication. Starting from a single high-level program, our compiler automatically generate highly optimized and specialized implementations. We group simple rewrite rules into more complex macro-rules, each describing a well-known optimization like tiling and register blocking in a composable way. Using an exploration strategy our compiler automatically generates 50,000 OpenCL kernels, each providing a differently optimized -- but provably correct -- implementation of matrix multiplication. The automatically generated code offers competitive performance compared to the manually tuned MAGMA library implementations of matrix multiplication on Nvidia and even outperforms AMD's clBLAS library.

[1] Timothy G. Mattson,et al. OpenCL Programming Guide , 2011 .

[2] Sebastian Hack,et al. A graph-based higher-order intermediate representation , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3] Trevor L. McDonell. Optimising purely functional GPU programs , 2013, ICFP.

[4] Elnar Hajiyev,et al. PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[5] Alan Edelman,et al. PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[6] Jack J. Dongarra,et al. Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[7] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[8] Edward G. Coffman,et al. Organizing matrices and matrix operations for paged memory systems , 1969, Commun. ACM.

[9] Sebastian Hack,et al. Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[10] Kunle Olukotun,et al. Locality-Aware Mapping of Nested Parallel Patterns on GPUs , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[11] H. Corporaal,et al. Bones , 2014, ACM Trans. Archit. Code Optim..

[12] Saman P. Amarasinghe,et al. Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[13] Stanislav G. Sedukhin,et al. Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[14] Kunle Olukotun,et al. Delite , 2014, ACM Trans. Embed. Comput. Syst..

[15] Jack Dongarra,et al. clMAGMA: high performance dense linear algebra with OpenCL , 2014, IWOCL '14.

[16] André Seznec,et al. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17] Francky Catthoor,et al. Polyhedral parallel code generation for CUDA , 2013, TACO.

[18] David F. Bacon,et al. Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[19] Sam Lindley,et al. Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[20] Sergei Gorlatch,et al. SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.