Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum performance or is written in a high-level language to achieve portability at the expense of performance. We propose a novel approach that offers high-level programming, code portability and high-performance. It is based on algorithmic pattern composition coupled with a powerful, yet simple, set of rewrite rules. This enables systematic transformation and optimization of a high-level program into a low-level hardware specific representation which leads to high performance code. We test our design in practice by describing a subset of the OpenCL programming model with low-level patterns and by implementing a compiler which generates high performance OpenCL code. Our experiments show that we can systematically derive high-performance device-specific implementations from simple high-level algorithmic expressions. The performance of the generated OpenCL code is on par with highly tuned implementations for multicore CPUs and GPUs written by experts

[1]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[2]  Roman Leshchinskiy,et al.  Stream fusion: from lists to streams to nothing at all , 2007, ICFP '07.

[3]  Abhishek Udupa,et al.  Software Pipelined Execution of Stream Programs on GPUs , 2009, 2009 International Symposium on Code Generation and Optimization.

[4]  Kunle Olukotun,et al.  Implementing Domain-Specific Languages for Heterogeneous Parallel Computing , 2011, IEEE Micro.

[5]  Markus Püschel,et al.  Bandit-based optimization on graphs with application to library performance tuning , 2009, ICML '09.

[6]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[7]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[8]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[9]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[10]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[11]  Thomas B. Jablin,et al.  Triolet: a programming system that unifies algorithmic skeleton interfaces for high-performance cluster computing , 2014, PPoPP '14.

[12]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[13]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[14]  Sean Lee,et al.  NOVA: A Functional Language for Data Parallelism , 2014, ARRAY@PLDI.

[15]  Simon Peyton Jones,et al.  Playing by the rules: rewriting as a practical optimisation technique in GHC , 2001 .

[16]  Frank Mueller,et al.  Hidp: A hierarchical data parallel language , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Tarek S. Abdelrahman,et al.  hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.

[18]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[19]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[20]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[21]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[22]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[23]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[24]  Alan Edelman,et al.  PetaBricks: a language and compiler for algorithmic choice , 2009, PLDI '09.

[25]  Francisco de Sande,et al.  accULL: An OpenACC Implementation with CUDA and OpenCL Support , 2012, Euro-Par.

[26]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[27]  Albert Cohen,et al.  Hybrid Hexagonal/Classical Tiling for GPUs , 2014, CGO '14.

[28]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Sunita Chandrasekaran,et al.  Reduction Operations in Parallel Loops for GPGPUs , 2014, PMAM'14.

[31]  Scott A. Mahlke,et al.  Paraprox: pattern-based approximation for data parallel applications , 2014, ASPLOS.

[32]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[33]  Franz Franchetti,et al.  Operator Language: A Program Generation Framework for Fast Kernels , 2009, DSL.

[34]  Sriram Krishnamoorthy,et al.  On the Use of Term Rewriting for Performance Optimization of Legacy HPC Applications , 2012, 2012 41st International Conference on Parallel Processing.

[35]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[36]  Weng-Fai Wong,et al.  Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.

[37]  Markus Püschel,et al.  A Basic Linear Algebra Compiler , 2014, CGO '14.

[38]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[39]  Basilio B. Fraguela,et al.  An Algorithm Template for Domain-Based Parallel Irregular Algorithms , 2014, International Journal of Parallel Programming.

[40]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[41]  Trevor L. McDonell Optimising purely functional GPU programs , 2013, ICFP.