LambdaJIT: a dynamic compiler for heterogeneous optimizations of STL algorithms

C++11 introduced a set of new features to extend the core language and the standard library. Amongst the new features are basic blocks for concurrency management like threads and atomic operation support, and a new syntax to declare single purpose, one off functions, called lambda functions, which integrate nicely to the Standard Template Library (STL). The STL provides a set of high level algorithms operating on data ranges, often applying a user defined function, which can now be expressed as a lambda function. Together, an STL algorithm and a lambda function provides a concise and efficient way to express a data traversal pattern and localized computation. This paper presents LambdaJIT; a C++11 compiler and a runtime system which enable lambda functions used alongside STL algorithms to be optimized or even re-targeted at runtime. We use compiler integration of the new C++ features to analyze and automatically parallelize the code whenever possible. The compiler also injects part of a program's internal representation into the compiled program binary, which can be used by the runtime to re-compile and optimize the code. We take advantage of the features of lambda functions to create runtime optimizations exceeding those of traditional offline or online compilers. Finally, the runtime can use the embedded intermediate representation with a different backend target to safely offload computation to an accelerator such as a GPU, matching and even outperforming CUDA by up to 10%.

[1]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[2]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Sergei Gorlatch,et al.  SkelCL - A Portable Skeleton Library for High-Level GPU Programming , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[4]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Nancy M. Amato,et al.  STAPL: standard template adaptive parallel library , 2010, SYSTOR '10.

[7]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[8]  Peter Sanders,et al.  MCSTL: the multi-core standard template library , 2007, PPOPP.

[9]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[11]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[12]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[13]  Kunle Olukotun,et al.  Surgical precision JIT compilers , 2014, PLDI.

[14]  Ali-Reza Adl-Tabatabai,et al.  Fast, effective code generation in a just-in-time Java compiler , 1998, PLDI.

[15]  Dawson R. Engler,et al.  C and tcc: a language and compiler for dynamic code generation , 1999, TOPL.

[16]  Michael D. McCool,et al.  Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language , 2011, International Symposium on Code Generation and Optimization (CGO 2011).