LIFT: A functional data-parallel IR for high-performance GPU code generation

Parallel patterns (e.g., map, reduce) have gained traction as an abstraction for targeting parallel accelerators and are a promising answer to the performance portability problem. However, compiling high-level programs into efficient low-level parallel code is challenging. Current approaches start from a high-level parallel IR and proceed to emit GPU code directly in one big step. Fixed strategies are used to optimize and map parallelism exploiting properties of a particular GPU generation leading to performance portability issues. We introduce the LIFT IR, a new data-parallel IR which encodes OpenCL-specific constructs as functional patterns. Our prior work has shown that this functional nature simplifies the exploration of optimizations and mapping of parallelism from portable high-level programs using rewrite-rules. This paper describes how LIFT IR programs are compiled into efficient OpenCL code. This is non-trivial as many performance sensitive details such as memory allocation, array accesses or synchronization are not explicitly represented in the LIFT IR. We present techniques which overcome this challenge by exploiting the pattern's high-level semantics. Our evaluation shows that the LIFT IR is flexible enough to express GPU programs with complex optimizations achieving performance on par with manually optimized code.

[1]  Kunle Olukotun,et al.  Locality-Aware Mapping of Nested Parallel Patterns on GPUs , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[3]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[4]  Trevor L. McDonell Optimising purely functional GPU programs , 2013, ICFP.

[5]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[6]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[7]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[8]  Kunle Olukotun,et al.  Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[9]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[10]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[11]  Patrick Maier,et al.  Towards an Adaptive Skeleton Framework for Performance Portability , 2015 .

[12]  Kunle Olukotun,et al.  Delite , 2014, ACM Trans. Embed. Comput. Syst..

[13]  Martin Elsman,et al.  Size slicing: a hybrid approach to size inference in futhark , 2014, FHPC '14.

[14]  Sebastian Hack,et al.  A graph-based higher-order intermediate representation , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[15]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[16]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[17]  Thomas Fahringer,et al.  INSPIRE: The insieme parallel intermediate representation , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[18]  Elnar Hajiyev,et al.  PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[19]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[20]  Frank Mueller,et al.  Hidp: A hierarchical data parallel language , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[21]  Sean Lee,et al.  NOVA: A Functional Language for Data Parallelism , 2014, ARRAY@PLDI.