A Taxonomy of GPGPU Performance Scaling

Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.

[1]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[4]  Jing Zhang,et al.  OpenCL and the 13 dwarfs: a work in progress , 2012, ICPE '12.

[5]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[7]  Mitesh R. Meswani,et al.  Efficient breadth-first search on a heterogeneous processor , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[8]  Mayank Daga,et al.  Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.