Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

In this paper, we present a methodology to understand GPU microarchitectural features and improve performance for compute-intensive kernels. The methodology relies on a reverse engineering approach to crack the GPU ISA encodings in order to build a GPU assembler. An assembly microbenchmark suite correlates microarchitectural features with their performance factors to uncover instruction-level and memory hierarchy preferences. We use SGEMM as a running example to show the ways to achieve bare-metal performance tuning. The performance boost is achieved by tuning FFMA throughput by activating dual-issue, eliminating register bank conflicts, adding non-FFMA instructions with little penalty, and choosing proper width of global/shared load instructions. On NVIDIA Kepler K20m, we develop a faster SGEMM with 3.1Tflop/s performance and 88% efficiency; the performance is 15% higher than cuBLAS7.0. Applying these optimizations to convolution, the implementation gains 39%-62% performance improvement compared with cuDNN4.0. The toolchain is an attempt to automatically crack different GPU ISA encodings and build an assembler adaptively for the purpose of performance enhancements to applications on GPUs.

[1]  Paulo Centoducatte,et al.  Extending the ArchC language for automatic generation of assemblers , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[2]  Xinxin Mei,et al.  Benchmarking the Memory Hierarchy of Modern GPUs , 2014, NPC.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Christian S. Collberg,et al.  Reverse interpretation + mutation analysis = automatic retargeting , 1997, PLDI '97.

[5]  Tanja Lange,et al.  Usable assembly language for GPUs: a success story , 2012, IACR Cryptol. ePrint Arch..

[6]  Jack Dongarra,et al.  An Improved MAGMA GEMM for Fermi GPUs , 2010 .

[7]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[8]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[9]  André Seznec,et al.  Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[10]  杰克·希莱尔·肖凯特,et al.  Methods and apparatus for source operand collector caching , 2012 .

[11]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[12]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[14]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Jingyue Wu,et al.  gpucc: An open-source GPGPU compiler , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[16]  Dawson R. Engler,et al.  Reverse-Engineering Instruction Encodings , 2001, USENIX Annual Technical Conference, General Track.