It is unquestionable that successive hardware generations have significantly improved GPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's GeForce 8800 GTX in November 2006 to its GeForce GTX 580 in November 2010. However, raw capability numbers typically underestimate the improvements in real application performance over the same time period, due to significant architectural feature improvements. To demonstrate the effects of architecture features and optimizations over time, we conducted experiments on a set of benchmarks from diverse application domains for multiple GPU architecture generations to understand how much performance has truly been improving for those workloads. First, we demonstrate that certain architectural features make a huge difference in the performance of unoptimized code, such as the inclusion of a general cache which can improve performance by 2-4× in some situations. Second, we describe what optimization patterns have been most essential and widely applicable for improving performance for GPU computing workloads across all architecture generations. Some important optimization patterns included data layout transformation, converting scatter accesses to gather accesses, GPU workload regularization, and granularity coarsening, each of which improved performance on some benchmark by over 20%, sometimes by a factor of more than 5×. While hardware improvements to baseline unoptimized code can reduce the speedup magnitude, these patterns remain important for even the most recent GPUs. Finally, we identify which added architectural features created significant new optimization opportunities, such as increased register file capacity or reduced bandwidth penalties for misaligned accesses, which increase performance by 2× or more in the optimized versions of relevant benchmarks.
[1]
Wen-mei W. Hwu,et al.
GPU Computing Gems Jade Edition
,
2011
.
[2]
Collin McCurdy,et al.
The Scalable Heterogeneous Computing (SHOC) benchmark suite
,
2010,
GPGPU-3.
[3]
Ulf Assarsson,et al.
Efficient stream compaction on wide SIMD many-core architectures
,
2009,
High Performance Graphics.
[4]
Timothy G. Mattson,et al.
Patterns for parallel programming
,
2004
.
[5]
James Demmel,et al.
Benchmarking GPUs to tune dense linear algebra
,
2008,
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[6]
Kevin Skadron,et al.
Rodinia: A benchmark suite for heterogeneous computing
,
2009,
2009 IEEE International Symposium on Workload Characterization (IISWC).
[7]
Kurt Keutzer,et al.
A design pattern language for engineering (parallel) software: merging the PLPP and OPL projects
,
2010,
ParaPLoP '10.