Power-efficient computing for compute-intensive GPGPU applications

The peak compute performance of GPUs has been increased by integrating more compute resources and operating them at higher frequency. However, such approaches significantly increase power consumption of GPUs, limiting further performance increase due to the power constraint. Facing such a challenge, we propose three techniques to improve power efficiency and performance of GPUs in this paper. First, we observe that many GPGPU applications are integer-intensive. For such applications, we combine a pair of dependent integer instructions into a composite instruction that can be executed by an enhanced fused multiply-add unit. Second, we observe that computations for many instructions are duplicated across multiple threads. We dynamically detect such instructions and execute them in a separate scalar unit. Finally, we observe that 16 or fewer bits are sufficient for accurate representation of operands and results of many instructions. Thus, we split the 32-bit datapath into two 16-bit datapath slices that can concurrently issue and execute up to two such instructions per cycle. All three proposed techniques can considerably increase utilization of compute resources, improving power efficiency and performance by 20% and 15%, respectively.

[1]  Edward T. Grochowski,et al.  Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[2]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[3]  Michael J. Schulte,et al.  Dual-mode floating-point multiplier architectures with parallel operations , 2006, J. Syst. Archit..

[4]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[5]  Mikko H. Lipasti,et al.  Macro-op Scheduling: Relaxing Scheduling Loop Constraints , 2003, MICRO.

[6]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[7]  William J. Dally,et al.  A compile-time managed multi-level register file hierarchy , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Li Shen,et al.  A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[11]  Kanad Ghose,et al.  Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[12]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[13]  Michael J. Schulte,et al.  ERCBench: An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing , 2010, 2010 International Conference on Field Programmable Logic and Applications.

[14]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[15]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[16]  David Harris,et al.  CMOS VLSI Design: A Circuits and Systems Perspective , 2004 .

[17]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Margaret Martonosi,et al.  Value-based clock gating and operation packing: dynamic strategies for improving processor power and performance , 2000, TOCS.