A performance prediction model for the CUDA GPGPU platform

The significant growth in computational power of modern Graphics Processing Units (GPUs) coupled with the advent of general purpose programming environments like NVIDIA's CUDA, has seen GPUs emerging as a very popular parallel computing platform. Till recently, there has not been a performance model for GPGPUs. The absence of such a model makes it difficult to definitively assess the suitability of the GPU for solving a particular problem and is a significant impediment to the mainstream adoption of GPUs as a massively parallel (super)computing platform. In this paper we present a performance prediction model for the CUDA GPGPU platform. This model encompasses the various facets of the GPU architecture like scheduling, memory hierarchy, and pipelining among others. We also perform experiments that demonstrate the effects of various memory access strategies. The proposed model can be used to analyze pseudo code for a CUDA kernel to obtain a performance estimate, in a way that is similar to performing asymptotic analysis. We illustrate the usage of our model and its accuracy with three case studies: matrix multiplication, list ranking, and histogram generation.

[1]  Hubert Nguyen,et al.  GPU Gems 3 , 2007 .

[2]  Thomas Ertl,et al.  Hardware Accelerated Wavelet Transformations , 2000, VisSym.

[3]  David A. Bader,et al.  On the Design and Analysis of Irregular Algorithms on the Cell Processor: A Case Study of List Ranking , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[5]  Dinesh Manocha,et al.  Cache-efficient numerical algorithms using graphics hardware , 2007, Parallel Comput..

[6]  Yossi Matias,et al.  The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms , 1999, SIAM J. Comput..

[7]  Angelos D. Keromytis,et al.  CryptoGraphics: Secret Key Cryptography Using Graphics Cards , 2005, CT-RSA.

[8]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[9]  N. England,et al.  Graphics Hardware , 2019, IEEE Computer Graphics and Applications.

[10]  P. J. Narayanan,et al.  Scalable Split and Gather Primitives for the GPU , 2009 .

[11]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[12]  Joseph JáJá,et al.  Designing Practical Efficient Algorithms for Symmetric Multiprocessors , 1999, ALENEX.

[13]  Hamid Laga,et al.  CUDA (Computer Unified Device Architecture) , 2009 .

[14]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[15]  Vincent Rijmen,et al.  Rijndael/AES , 2005, Encyclopedia of Cryptography and Security.

[16]  Ivan Viola,et al.  Hardware-based nonlinear filtering and segmentation using high-level shading languages , 2003, IEEE Visualization, 2003. VIS 2003..

[17]  P. J. Narayanan,et al.  Fast and scalable list ranking on the GPU , 2009, ICS.

[18]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[19]  Jason Yang,et al.  Symmetric Key Cryptography on Modern Graphics Hardware , 2007, ASIACRYPT.

[20]  Gary L. Miller,et al.  A Simple Randomized Parallel Algorithm for List-Ranking , 1990, Inf. Process. Lett..

[21]  David P. Anderson,et al.  SETI@home: an experiment in public-resource computing , 2002, CACM.

[22]  Richard Cole,et al.  Faster Optimal Parallel Prefix Sums and List Ranking , 2011, Inf. Comput..

[23]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[24]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[25]  P. J. Narayanan,et al.  CUDA cuts: Fast graph cuts on the GPU , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[26]  Bruce Schneier,et al.  Description of a New Variable-Length Key, 64-bit Block Cipher (Blowfish) , 1993, FSE.

[27]  Ramani Duraiswami,et al.  Canny edge detection on NVIDIA CUDA , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[28]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[29]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[30]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[31]  Steven Fortune,et al.  Parallelism in random access machines , 1978, STOC.

[32]  David Brumley,et al.  Remote timing attacks are practical , 2003, Comput. Networks.

[33]  John Waldron,et al.  Practical Symmetric Key Cryptography on Modern Graphics Hardware , 2008, USENIX Security Symposium.

[34]  Paul C. Kocher,et al.  Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems , 1996, CRYPTO.

[35]  Yossi Matias,et al.  The Queue-Read Queue-Write Asynchronous PRAM Model , 1996, Theor. Comput. Sci..

[36]  Peter Schwabe,et al.  Faster and Timing-Attack Resistant AES-GCM , 2009, CHES.

[37]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[38]  Vijay S. Pande,et al.  Folding@home: Lessons from eight years of volunteer distributed computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[39]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[40]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[41]  Emilio L. Zapata,et al.  Memory Locality Exploitation Strategies for FFT on the CUDA Architecture , 2008, VECPAR.

[42]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[43]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[44]  Michael T. Goodrich,et al.  A bridging model for parallel computation, communication, and I/O , 1996, CSUR.

[45]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).