论文信息 - Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling

Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling

State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of integrated cores. In general, the peak throughput of a GPU is proportional to the product of the number of cores and their frequency. However, the product is often limited by a power constraint. Although the throughput can be increased with more cores for some applications, it cannot for others because parallelism of applications and/or bandwidth of on-chip interconnects/caches and off-chip memory are limited. In this paper, first, we demonstrate that adjusting the number of operating cores and the voltage/frequency of cores and/or on-chip interconnects/caches for different applications can improve the throughput of GPUs under a power constraint. Second, we show that dynamically scaling the number of operating cores and the voltages/frequencies of both cores and on-chip interconnects/caches at runtime can improve the throughput of application even further. Our experimental results show that a GPU adopting our runtime dynamic voltage/frequency and core scaling technique can provide up to 38% (and nearly 20% on average) higher throughput than the baseline GPU under the same power constraint.

Nam Sung Kim | Michael J. Schulte | Katherine Compton | Jungseob Lee | Vijay Sathisha

[1] Bobby Bodenheimer,et al. Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[2] Yu Cao,et al. New Generation of Predictive Technology Model for Sub-45 nm Early Design Exploration , 2006, IEEE Transactions on Electron Devices.

[3] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[4] Kemal Aygun. Power Delivery for High-Performance Microprocessors , 2005 .

[5] Andrew B. Kahng,et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[6] Jian Li,et al. Dynamic power-performance adaptation of parallel computation on chip multiprocessors , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[7] Srikanth Balasubramanian. Power delivery for high performance microprocessors , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).

[8] Meeta Sharma Gupta,et al. System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[9] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[11] John Kim,et al. Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[12] Anjul Patney,et al. Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.

[13] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[14] Nam Sung Kim,et al. Optimizing total power of many-core processors considering voltage scaling limit and process variations , 2009, ISLPED.

[15] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16] Diana Marculescu,et al. Power aware microarchitecture resource scaling , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[17] Anoop Gupta,et al. Parallel computer architecture - a hardware / software approach , 1998 .