Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling

State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of integrated cores. In general, the peak throughput of a GPU is proportional to the product of the number of cores and their frequency. However, the product is often limited by a power constraint. Although the throughput can be increased with more cores for some applications, it cannot for others because parallelism of applications and/or bandwidth of on-chip interconnects/caches and off-chip memory are limited. In this paper, first, we demonstrate that adjusting the number of operating cores and the voltage/frequency of cores and/or on-chip interconnects/caches for different applications can improve the throughput of GPUs under a power constraint. Second, we show that dynamically scaling the number of operating cores and the voltages/frequencies of both cores and on-chip interconnects/caches at runtime can improve the throughput of application even further. Our experimental results show that a GPU adopting our runtime dynamic voltage/frequency and core scaling technique can provide up to 38% (and nearly 20% on average) higher throughput than the baseline GPU under the same power constraint.

[1]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[2]  Yu Cao,et al.  New Generation of Predictive Technology Model for Sub-45 nm Early Design Exploration , 2006, IEEE Transactions on Electron Devices.

[3]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[4]  Kemal Aygun Power Delivery for High-Performance Microprocessors , 2005 .

[5]  Andrew B. Kahng,et al.  ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[6]  Jian Li,et al.  Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[7]  Srikanth Balasubramanian Power delivery for high performance microprocessors , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).

[8]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[9]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[11]  John Kim,et al.  Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Anjul Patney,et al.  Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.

[13]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[14]  Nam Sung Kim,et al.  Optimizing total power of many-core processors considering voltage scaling limit and process variations , 2009, ISLPED.

[15]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16]  Diana Marculescu,et al.  Power aware microarchitecture resource scaling , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[17]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .