Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications

In this paper we present an analytical model to predict the performance of general purpose applications on a GPU architecture. Themodel is designed to provide performance information to an auto-tuning compiler and assist it narrow the search to the more promising implementations. This work is based on the NVIDIAGPUs using CUDA (ComputeUnified Device Architecture). We analyze each CUDA kernel and generate the corresponding string model which is a concise representation of the operations of a kernel. String model for a kernel summarizes how the kernel exercises major GPU microarchitecture features. Based on the string model we estimate the average execution time of a warp, which is the SIMD work granularity for CUDA. We validated the performance model using a few data parallel benchmarks that exploit different microarchitecture features of GPU architecture. The model captures full system complexity and shows high accuracy in predicting the performance trend of different optimized implementations. We also describe our approach to extract the performance model automatically.

[1]  Marc Snir,et al.  Automatic tuning matrix multiplication performance on graphics hardware , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[2]  Michael Wolfe,et al.  Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form , 1995, TOPL.

[3]  Rudolf Eigenmann,et al.  Fast and effective orchestration of compiler optimizations for automatic performance tuning , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[4]  Ken Kennedy,et al.  Automatic tuning of whole applications using direct search and a performance-based transformation system , 2006, The Journal of Supercomputing.

[5]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[6]  John M. Mellor-Crummey,et al.  Cross-architecture performance predictions for scientific applications using parameterized models , 2004, SIGMETRICS '04/Performance '04.

[7]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.

[8]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[9]  Guy E. Blelloch,et al.  Prefix sums and their applications , 1990 .

[10]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1984, TOPL.

[11]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[12]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[13]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[14]  Michael Wolfe,et al.  Beyond induction variables , 1992, PLDI '92.

[15]  David I. August,et al.  Compiler optimization-space exploration , 2003, International Symposium on Code Generation and Optimization, 2003. CGO 2003..

[16]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[17]  Mark J. Clement,et al.  Analytical performance prediction on multicomputers , 1993, Supercomputing '93. Proceedings.

[18]  Weiguo Liu,et al.  Performance Predictions for General-Purpose Computation on GPUs , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[19]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[20]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[21]  Rudolf Eigenmann,et al.  Fast, automatic, procedure-level performance tuning , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  R. C. Whaley,et al.  Timing high performance kernels through empirical compilation , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[23]  FerranteJeanne,et al.  The program dependence graph and its use in optimization , 1987 .