A Linear Performance-Breakdown Model for GPU Programming Optimization Guidance

The use Graphic Processing Units (GPU) as computing accelerators has been. Nevertheless, writing efficient GPU programs is a difficult and time consuming task. In this paper we present the Linear Performance Breakdown Model (LBPM), an analytic model that is used to extract the breakdown of GPU kernel programs execution time into the three major components that affect its running time. The model can be used as a tool to provide guidelines to detect the performance bottlenecks. Our approach is the incorporation of three elements, the Global-to-Shared Memory Time slice, Shared-to-Private Time slice and Processing Units Time slice. These three factors are integrated into a performance model formula by applying the Normalized Least Squares Method (NLSM). The resulting coefficients are used to construct a performance breakdown graph that reveals the effects of each element in the total execution time of the kernel program. We demonstrate the results obtained with our proposed model with two common numeric routines: Single-Precision General Matrix Multiplication (SGMM) and Fast Fourier Transform (FFT), and apply the model to the results obtained from two GPU devices: A8-3870 AMD Accelerated Processing Unit (APU) and a GTX 660 Nvidia GPU.

[1]  K. Srinathan,et al.  A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[2]  Matthew Scarpino OpenCL in Action: How to Accelerate Graphics and Computations , 2011 .

[3]  Jiayuan Meng,et al.  Improving GPU Performance Prediction with Data Transfer Modeling , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[4]  Xiaodong Zhang Performance Measurement and Modeling to Evaluate Various Effects on a Shared Memory Multiprocessor , 1991, IEEE Trans. Software Eng..

[5]  Nicholas Wilt,et al.  The CUDA Handbook: A Comprehensive Guide to GPU Programming , 2013 .

[6]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[7]  Paolo Bientinesi,et al.  Modeling performance through memory-stalls , 2012, PERV.

[8]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[9]  Venkatram Vishwanath,et al.  GROPHECY: GPU performance projection from CPU code skeletons , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[11]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[12]  Sally A. McKee,et al.  Methods of inference and learning for performance modeling of parallel applications , 2007, PPoPP.