Performance and energy consumption analysis of java code utilizing embedded GPU

GPUs and multicore CPUs are becoming common in today's embedded world of tablets and smartphones. With CPUs and GPUs getting more complex, maximizing hardware utilization and minimizing energy consumption are becoming problematic. The challenges faced in GPGPU computing on embedded platforms are different from their desktop counterparts due to the memory and computational limitations. This study evaluates the advantages of offloading Java applications to an embedded GPU. By employing two approaches namely, Java Native Interface (JNI-OpenCL) and Java bindings for OpenCL (JOCL) we allowed programmers to program an embedded GPU from Java. Experiments were conducted on a Freescale i.MX6Q SabreLite board which contains a quad-core ARM Cortex A9 CPU and a Vivante GC 2000 GPU that supports the OpenCL 1.1 Embedded Profile. The results show up to an eight times increase in performance efficiency by consuming only one-third the energy compared to the CPU-only version of the Java program. This paper demonstrates the performance and energy benefits achieved by offloading Java programs onto an embedded GPU. To the best of our knowledge, this is the first work involving Java acceleration on embedded GPUs.

[1]  Petru Eles,et al.  General purpose computing on low-power embedded GPUs: Has it come of age? , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[2]  Ondrej Lhoták,et al.  Automatic parallelization for graphics processing units , 2009, PPPJ '09.

[3]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[4]  Jeffrey G. Gray,et al.  CUDACL: A tool for CUDA and OpenCL programmers , 2010, 2010 International Conference on High Performance Computing.

[5]  Jyrki Leskela,et al.  OpenCL embedded profile prototype in mobile device , 2009, 2009 IEEE Workshop on Signal Processing Systems.

[6]  Anton Obukhov,et al.  Discrete Cosine Transform for 8x8 Blocks with CUDA , 2008 .

[7]  J. R. Vaughan,et al.  The Mandelbrot set as a parallel processing benchmark , 1989 .

[8]  Sabela Ramos,et al.  Evaluation of Java for General Purpose GPU Computing , 2013, 2013 27th International Conference on Advanced Information Networking and Applications Workshops.

[9]  Philip C. Pratt-Szeliga,et al.  Rootbeer: Seamlessly Using GPUs from Java , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[10]  Daniel A. Ashlock Evolutionary Exploration of the Mandelbrot Set , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[11]  Laurie Hendren,et al.  Soot: a Java bytecode optimization framework , 2010, CASCON.

[12]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[13]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[14]  Voicu Groza,et al.  Returning Control to the Programmer , 2011 .

[15]  José Luis Lázaro,et al.  GPU Acceleration on Embedded Devices. A Power Consumption Approach , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[16]  N. Ahmed,et al.  Discrete Cosine Transform , 1996 .

[17]  Voicu Groza,et al.  Returning control to the programmer: SIMD intrinsics for virtual machines , 2011, CACM.