Compiling and Optimizing Java 8 Programs for GPU Execution

GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as CUDA or OpenCL, and 3) optimizing the kernels by utilizing appropriate memory types on the GPU. Because of this complexity, in many cases, only expert programmers can exploit the computational capabilities of GPUs through the CUDA/OpenCL languages. This is unfortunate since a large number of programmers use high-level languages, such as Java, due to their advantages of productivity, safety, and platform portability, but would still like to exploit the performance benefits of GPUs. Thus, one challenging problem is how to utilize GPUs while allowing programmers to continue to benefit from the productivity advantages of languages like Java. This paper presents a just-in-time (JIT) compiler that can generate and optimize GPU code from a pure Java program written using lambda expressions with the new parallel streams APIs in Java 8. These APIs allow Java programmers to express data parallelism at a higher level than threads and tasks. Our approach translates lambda expressions with parallel streams APIs in Java 8 into GPU code and automatically generates runtime calls that handle the low-level operations mentioned above. Additionally, our optimization techniques 1) allocate and align the starting address of the Java array body in the GPUs with the memory transaction boundary to increase memory bandwidth, 2) utilize read-only cache for array accesses to increase memory efficiency in GPUs, and 3) eliminate redundant data transfer between the host and the GPU. The compiler also performs loop versioning for eliminating redundant exception checks and for supporting virtual method invocations within GPU kernels. These features and optimizations are supported and automatically performed by a JIT compiler that is built on top of a production version of the IBM Java 8 runtime environment. Our experimental results on an NVIDIA Tesla GPU show significant performance improvements over sequential execution (127.9 × geometric mean) and parallel execution (3.3 × geometric mean) for eight Java 8 benchmark programs running on a 160-thread POWER8 machine. This paper also includes an in-depth analysis of GPU execution to show the impact of our optimization techniques by selectively disabling each optimization. Our experimental results show a geometric-mean speed-up of 1.15 × in the GPU kernel over state-of-the-art approaches. Overall, our JIT compiler can improve the performance of Java 8 programs by automatically leveraging the computational capability of GPUs.

[1]  James F. Power,et al.  Platform independent dynamic Java virtual machine analysis: the Java Grande Forum benchmark suite , 2001, JGI '01.

[2]  Vivek Sarkar,et al.  HJ-OpenCL: Reducing the Gap Between the JVM and Accelerators , 2015, PPPJ.

[3]  Dirk Grunwald,et al.  Reducing indirect function call overhead in C++ programs , 1994, POPL '94.

[4]  Vivek Sarkar,et al.  Speculative Execution of Parallel Programs with Precise Exception Semantics on GPUs , 2013, LCPC.

[5]  Michel Steuwer,et al.  A Composable Array Function Interface for Heterogeneous Computing in Java , 2014, ARRAY@PLDI.

[7]  Guy L. Steele,et al.  The Java Language Specification , 1996 .

[8]  Francky Catthoor,et al.  Polyhedral parallel code generation for CUDA , 2013, TACO.

[9]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[10]  Vivek Sarkar,et al.  Accelerating Habanero-Java programs with OpenCL generation , 2013, PPPJ.

[11]  R. Govindarajan,et al.  Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Feng Liu,et al.  Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.

[13]  Vivek Sarkar,et al.  Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection , 2015, PPPJ.

[14]  Samuel P. Midkiff,et al.  Automatic loop transformations and parallelization for Java , 2000, ICS '00.

[15]  Vivek Sarkar,et al.  HadoopCL: MapReduce on Distributed Heterogeneous Platforms through Seamless Integration of Hadoop and OpenCL , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[16]  Nikola Grcevski,et al.  Java Just-in-Time Compiler and Virtual Machine Improvements for Server and Middleware Applications , 2004, Virtual Machine Research and Technology Symposium.

[17]  Wojciech Zaremba,et al.  JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units , 2012, GPGPU-5.

[18]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[19]  Ondrej Lhoták,et al.  Automatic parallelization for graphics processing units , 2009, PPPJ '09.

[20]  Toshiaki Yasue,et al.  A study of devirtualization techniques for a Java Just-In-Time compiler , 2000, OOPSLA '00.

[21]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[22]  Guy L. Steele,et al.  The Java Language Specification, Java SE 8 Edition , 2013 .

[23]  Philip C. Pratt-Szeliga,et al.  Rootbeer: Seamlessly Using GPUs from Java , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[24]  Vivek Sarkar,et al.  JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA , 2009, Euro-Par.

[25]  Vivek Sarkar,et al.  Array SSA form and its use in parallelization , 1998, POPL '98.