A Memory Model for Scientific Algorithms on Graphics Processors

We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5x performance improvement

[1]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[2]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[3]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[4]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[5]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[6]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[7]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[8]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA computers , 1993, TOCS.

[9]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[10]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[11]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[12]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[13]  PingaliKeshav,et al.  Data-centric multi-level blocking , 1997 .

[14]  Anoop Gupta,et al.  The Design and Analysis of a Cache Architecture for Texture Mapping , 1997, ISCA.

[15]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[16]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[17]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[18]  David K. McAllister,et al.  Fast Matrix Multiplies Using Graphics Hardware , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[19]  Martin Rumpf,et al.  Using Graphics Cards for Quantized FEM Computations , 2001, VIIP.

[20]  Anselmo Lastra,et al.  Simulation of cloud dynamics on graphics hardware , 2003, HWWS '03.

[21]  Pat Hanrahan,et al.  Photon mapping on programmable graphics hardware , 2003, HWWS '03.

[22]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, ACM Trans. Graph..

[23]  Ming C. Lin,et al.  Visual simulation of ice crystal growth , 2003, SCA '03.

[24]  GrinspunEitan,et al.  Sparse matrix solvers on the GPU , 2003 .

[25]  Michael D. McCool,et al.  Shader algebra , 2004, ACM Trans. Graph..

[26]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[27]  Rüdiger Westermann,et al.  UberFlow: a GPU-based particle engine , 2004, SIGGRAPH '04.

[28]  Lars Arge,et al.  Cache-Oblivious Data Structures , 2004, Handbook of Data Structures and Applications.

[29]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[30]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[31]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[32]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[33]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[34]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[35]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[36]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.