Automatic Matching of Legacy Code to Heterogeneous APIs: An Idiomatic Approach

Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26x to over 20x speedup on integrated and external GPUs.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[3]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[4]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[5]  J. Ramanujam,et al.  A framework for enhancing data reuse via associative reordering , 2014, PLDI.

[6]  Keshav Pingali,et al.  The program structure tree: computing control regions in linear time , 1994, PLDI '94.

[7]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[8]  Lawrence Rauchwerger,et al.  An Adaptive Algorithm Selection Framework for Reduction Parallelization , 2006, IEEE Transactions on Parallel and Distributed Systems.

[9]  Alvin Cheung,et al.  Verified lifting of stencil computations , 2016, PLDI.

[10]  Michael Garland,et al.  Architecture-Adaptive Code Variant Tuning , 2016, ASPLOS.

[11]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[12]  Albert Cohen,et al.  Reduction drawing: Language constructs and polyhedral compilation for reductions on GPUs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[13]  Trevor L. McDonell Optimising purely functional GPU programs , 2013, ICFP.

[14]  Emilio L. Zapata,et al.  A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors , 2000, ICS '00.

[15]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[16]  Michel Steuwer,et al.  LIFT: A functional data-parallel IR for high-performance GPU code generation , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[17]  Anders Logg,et al.  Unified form language: A domain-specific language for weak formulations of partial differential equations , 2012, TOMS.

[18]  Martin Odersky,et al.  Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs , 2010, GPCE '10.

[19]  Gautam Gupta Simplifying reductions , 2006, POPL '06.

[20]  Ron Y. Pinter,et al.  Program optimization and parallelization using idioms , 1991, POPL '91.

[21]  Emilio L. Zapata,et al.  An analytical model of locality-based parallel irregular reductions , 2008, Parallel Comput..

[22]  Pierre Jouvelot,et al.  A unified semantic approach for the vectorization and parallelization of generalized reductions , 1989, ICS '89.

[23]  Flemming Nielson,et al.  Principles of Program Analysis , 1999, Springer Berlin Heidelberg.

[24]  José M. Andión Compilation techniques for automatic extraction of parallelism and locality in heterogeneous architectures , 2015 .

[25]  Toshio Nakatani,et al.  Detection and global optimization of reduction operations for distributed parallel machines , 1996, ICS '96.

[26]  Jason Merrill Generic and gimple: A new tree represen-tation for entire functions , 2003 .

[27]  Franz Franchetti,et al.  Operator Language: A Program Generation Framework for Fast Kernels , 2009, DSL.

[28]  Martin Odersky,et al.  Spiral in scala: towards the systematic construction of generators for performance libraries , 2014, GPCE '13.

[29]  José Manuel Andión Fernández Compilation techniques for automatic extraction of parallelism and locality in heterogeneous architectures , 2015 .

[30]  Emilio L. Zapata,et al.  Optimization techniques for parallel irregular reductions , 2003, J. Syst. Archit..

[31]  Anna Philippou,et al.  Tools and Algorithms for the Construction and Analysis of Systems , 2018, Lecture Notes in Computer Science.

[32]  Sebastian Hack,et al.  Polly's Polyhedral Scheduling in the Presence of Reductions , 2015, ArXiv.

[33]  Kunle Olukotun,et al.  A Heterogeneous Parallel Framework for Domain-Specific Languages , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[34]  Albert Cohen,et al.  PENCIL Language Specification , 2015 .

[35]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[36]  Michael F. P. O'Boyle,et al.  Discovery and exploitation of general reductions: A constraint based approach , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[37]  Allan L. Fisher,et al.  Parallelizing complex scans and reductions , 1994, PLDI '94.

[38]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[39]  Kurt Keutzer,et al.  Copperhead: compiling an embedded data parallel language , 2011, PPoPP '11.

[40]  Paul Feautrier,et al.  Scheduling reductions , 1994, ICS '94.

[41]  Sylvain Paris,et al.  Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code , 2015, PLDI.

[42]  Kunle Olukotun,et al.  A domain-specific approach to heterogeneous parallelism , 2011, PPoPP '11.

[43]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[44]  Sean Lee,et al.  NOVA: A Functional Language for Data Parallelism , 2014, ARRAY@PLDI.

[45]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[46]  Chi-Chung Lam,et al.  On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution , 1997, Parallel Process. Lett..

[47]  Sam Lindley,et al.  Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code , 2015, ICFP.

[48]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[49]  Sergei Gorlatch,et al.  High performance stencil code generation with Lift , 2018, CGO.

[50]  Rudolf Eigenmann,et al.  Idiom recognition in the Polaris parallelizing compiler , 1995, ICS '95.

[51]  Gagan Agrawal,et al.  Porting irregular reductions on heterogeneous CPU-GPU configurations , 2011, 2011 18th International Conference on High Performance Computing.