Analyzing power efficiency of optimization techniques and algorithm design methods for applications on heterogeneous platforms

Graphics processing units (GPUs) have become widely accepted as the computing platform of choice in many high performance computing domains. The availability of programming standards such as OpenCL are used to leverage the inherent parallelism offered by GPUs. Source code optimizations such as loop unrolling and tiling when targeted to heterogeneous applications have reported large gains in performance. However, given the power consumption of GPUs, platforms can exhaust their power budgets quickly. Better solutions are needed to effectively exploit the power-efficiency available on heterogeneous systems. In this work, we evaluate the power/performance efficiency of different optimizations used on heterogeneous applications. We analyze the power/performance trade-off by evaluating energy consumption of the optimizations. We compare the performance of different optimization techniques on four different fast Fourier transform implementations. Our study covers discrete GPUs, shared memory GPUs (APUs) and low power system-on-chip (SoC) devices, and includes hardware from AMD (Llano APUs and the Southern Islands GPU), Nvidia (Kepler), Intel (Ivy Bridge) and Qualcomm (Snapdragon S4) as test platforms. The study identifies the architectural and algorithmic factors which can most impact power consumption. We explore a range of application optimizations which show an increase in power consumption by 27%, but result in more than 1.8 × increase in speed of performance. We observe up to an 18% reduction in power consumption due to reduced kernel calls across FFT implementations. We also observe an 11% variation in energy consumption among different optimizations. We highlight how different optimizations can improve the execution performance of a heterogeneous application, but also impact the power efficiency of the application. More importantly, we demonstrate that different algorithms implementing the same fundamental function (FFT) can perform with vast differences based on the target hardware and associated application design.

[1]  D. Kaeli,et al.  Low-cost Techniques for Reducing Branch Context Pollution in a Soft Realtime Embedded Multithreaded Processor , 2007, Symposium on Computer Architecture and High Performance Computing.

[2]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[3]  George Varghese,et al.  A 22nm IA multi-CPU and GPU System-on-Chip , 2012, 2012 IEEE International Solid-State Circuits Conference.

[4]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[5]  F. Al-Shamali,et al.  Author Biographies. , 2015, Journal of social work in disability & rehabilitation.

[6]  Toshio Endo,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, HiPC 2008.

[7]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[9]  Reiji Suda,et al.  Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[10]  Wayne Luk,et al.  Power profiling and optimization for heterogeneous multi-core systems , 2011, CARN.

[11]  David R. Kaeli,et al.  Exploring Novel Parallelization Technologies for 3-D Imaging Applications , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[12]  Majid Sarrafzadeh,et al.  Energy-aware high performance computing with graphic processing units , 2008, CLUSTER 2008.

[13]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  David R. Kaeli,et al.  Quantifying the energy efficiency of FFT on heterogeneous platforms , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[15]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Mahmut T. Kandemir,et al.  The design and use of simplePower: a cycle-accurate energy estimation tool , 2000, Proceedings 37th Design Automation Conference.

[17]  Jean-Yves Blanc,et al.  Imaging Earth ’ s Subsurface Using CUDA , 2007 .

[18]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[19]  Satoshi Matsuoka,et al.  Statistical power modeling of GPU kernels using performance counters , 2010, International Conference on Green Computing.

[20]  Martin Vetterli,et al.  Fast Fourier transforms: a tutorial review and a state of the art , 1990 .

[21]  Piotr Indyk,et al.  Faster GPS via the sparse fourier transform , 2012, Mobicom '12.

[22]  David R. Kaeli,et al.  Architecture-aware optimization targeting multithreaded stream computing , 2009, GPGPU-2.

[23]  Donggang Liu,et al.  Combating side-channel attacks using key management , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[24]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[25]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[26]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[27]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Arnaud Tisserand,et al.  Power Consumption of GPUs from a Software Perspective , 2009, ICCS.

[29]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  David Kaeli,et al.  Heterogeneous Computing with OpenCL , 2011 .

[31]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[32]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .

[33]  Matt Pharr,et al.  Gpu gems 2: programming techniques for high-performance graphics and general-purpose computation , 2005 .

[34]  Haoran Yi,et al.  How GPUs Can Improve the Quality of Magnetic Resonance Imaging , 2011 .

[35]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[36]  G. D. Peterson,et al.  Power Aware Computing on GPUs , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.