Analyzing Optimization Techniques for Power Efficiency on Heterogeneous Platforms

Graphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high performance computing domains. The availability of programming standards such as OpenCL are used to leverage the inherent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted to heterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs, platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of different optimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energy consumption of the optimizations. We compare the performance of different optimization techniques on 4 different Fast Fourier Transform implementations. Our study covers discrete GPUs and shared memory GPUs (APUs), and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler) and Intel (Ivy Bridge) as test platforms. The study identifies the architectural and algorithmic factors which can most impact power consumption. We explore arange of application optimizations which show an increase in power consumption by 27%, but result in more than a 1.8Xspeedup in performance. We observe a 11% variation in energy consumption among different optimizations. We highlight how different optimizations can improve the execution performance of a heterogeneous application, but also impact power efficiency of the application.

[1]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[2]  Reiji Suda,et al.  Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[3]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[4]  Piotr Indyk,et al.  Faster GPS via the sparse fourier transform , 2012, Mobicom '12.

[5]  David R. Kaeli,et al.  Architecture-aware optimization targeting multithreaded stream computing , 2009, GPGPU-2.

[6]  Mahmut T. Kandemir,et al.  The design and use of simplePower: a cycle-accurate energy estimation tool , 2000, Proceedings 37th Design Automation Conference.

[7]  G. D. Peterson,et al.  Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[8]  Donggang Liu,et al.  Combating side-channel attacks using key management , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[10]  Haoran Yi,et al.  How GPUs Can Improve the Quality of Magnetic Resonance Imaging , 2011 .

[11]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[12]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[13]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .

[14]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[15]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[16]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[17]  Wayne Luk,et al.  Power profiling and optimization for heterogeneous multi-core systems , 2011, CARN.

[18]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  David R. Kaeli,et al.  Exploring Novel Parallelization Technologies for 3-D Imaging Applications , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[20]  Arnaud Tisserand,et al.  Power Consumption of GPUs from a Software Perspective , 2009, ICCS.

[21]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  David Kaeli,et al.  Heterogeneous Computing with OpenCL , 2011 .

[23]  Majid Sarrafzadeh,et al.  Energy-aware high performance computing with graphic processing units , 2008, CLUSTER 2008.

[24]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[25]  Jean-Yves Blanc,et al.  Imaging Earth ’ s Subsurface Using CUDA , 2007 .

[26]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[27]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29]  George Varghese,et al.  A 22nm IA multi-CPU and GPU System-on-Chip , 2012, 2012 IEEE International Solid-State Circuits Conference.

[30]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[31]  Satoshi Matsuoka,et al.  Statistical power modeling of GPU kernels using performance counters , 2010, International Conference on Green Computing.

[32]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[33]  David R. Kaeli,et al.  Quantifying the energy efficiency of FFT on heterogeneous platforms , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).