Applying source level auto-vectorization to Aparapi Java

Parallelism dominates modern hardware design, from multi-core CPUs to SIMD and GPGPU. This bring with it, however, a need to program this hardware in a programmer-friendly manner. Traditionally, managed languages like Java have struggled to take advantage of data-parallel hardware, but projects like Aparapi provide a programming model that lets the programmer easily express the parallelism within their code, while still programming in a high-level language. This work takes advantage of this programmer-specified parallelism to perform source-level auto-vectorization, an optimization that is rarely considered in Java compilation. This is done using a source-to-source auto-vectorization transformation on the Aparapi Java program and a JNI vector library that is pre-compiled to take advantage of available SIMD instructions. This replaces the existing Aparapi fallback path, for when no OpenCL device exists or if that device has insufficient memory for the program. We show that for all ten benchmarks tested the auto-vectorization tool produced an implementation that was able to beat the default Aparapi fallback path by a factor of 4.56x or 3.24x on average for a desktop and a server system respectively. In addition it was found that this improved fallback path even outperformed the GPU implementation for six of the ten benchmarks.

[1]  Jack J. Purdum,et al.  C programming guide , 1983 .

[2]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[3]  Ahmed El-Mahdy,et al.  Automatic Vectorization Using Dynamic Compilation and Tree Pattern Matching Technique in Jikes RVM , 2009 .

[4]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[5]  Philip C. Pratt-Szeliga,et al.  Rootbeer: Seamlessly Using GPUs from Java , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[6]  Vivek Sarkar,et al.  Accelerating Habanero-Java programs with OpenCL generation , 2013, PPPJ.

[7]  Albert Cohen,et al.  Vapor SIMD: Auto-vectorize once, run everywhere , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[8]  Pat Hanrahan,et al.  Data Parallel Computation on Graphics Hardware , 2003 .

[9]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[10]  Tia Newhall,et al.  Chestnut: a GPU programming language for non-experts , 2012, PMAM '12.

[11]  Rainer Plaga The GRAAL project , 1999 .

[12]  William E. Byrd,et al.  Declarative Parallel Programming for GPUs , 2011, PARCO.

[13]  Erik R. Altman,et al.  The Liquid Metal Blokus Duo Design , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[14]  Klemons Im,et al.  [Transcendental meditation]. , 1975, Ugeskrift for laeger.

[15]  J. Mark Bull,et al.  Benchmarking Java against C and Fortran for scientific applications , 2001, JGI '01.

[16]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[17]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[18]  Xiao-Feng Li,et al.  Vectorization for Java , 2010, NPC.

[19]  G. P. Nikishkov,et al.  Comparison of C and Java performance in finite element computations , 2003 .