Cache-efficient numerical algorithms using graphics hardware

We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. This mapping exploits the inherent parallelism, pipelining and high memory bandwidth on GPUs. We further improve the performance of numerical algorithms by accounting for the same relative memory address accesses performed at data elements in nested loops. Based on the similarity of memory accesses performed at the data elements in the input array, we decompose the input arrays into sub-arrays with similar memory access patterns and execute on the sub-arrays for faster execution. Our approach achieves high memory performance on GPUs by tiling the computation and thereby improving the cache-efficiency. Overall, our formulation for GPU-based algorithms extends the current graphics runtime APIs without exposing the underlying hardware complexity to the programmer. This makes it possible to achieve portability and higher performance across different GPUs. We use this approach to improve the performance of GPU-based sorting, fast Fourier transform and dense matrix multiplication algorithms. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we observe 2-10x improvement in performance.

[1]  R. Yavne An Economical Method for Calculating the Discrete Fourier Transform , 1899 .

[2]  Michael D. McCool,et al.  Shader algebra , 2004, ACM Trans. Graph..

[3]  Rin-ichiro Taniguchi,et al.  Real-time image processing on IEEE1394-based PC cluster , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[4]  H. T. Kung,et al.  Sorting on a mesh-connected parallel computer , 1977, CACM.

[5]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[6]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[7]  S. Winograd On computing the Discrete Fourier Transform. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Murray Cole,et al.  Algorithmic Skeletons , 2006, Research Directions in Parallel Functional Programming.

[9]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[10]  David H. Bailey A High-Performance FFT Algorithm for Vector Supercomputers , 1987, PPSC.

[11]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[12]  Takashi Matsuyama,et al.  Real-time active 3D shape reconstruction for 3D video , 2003, 3rd International Symposium on Image and Signal Processing and Analysis, 2003. ISPA 2003. Proceedings of the.

[13]  Michael E. Saks,et al.  The periodic balanced sorting network , 1989, JACM.

[14]  Joel Falcou,et al.  An object oriented SIMD library. , 2005 .

[15]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[16]  Keshav Pingali,et al.  Access normalization: loop restructuring for NUMA computers , 1993, TOCS.

[17]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[18]  Bruno Raffin,et al.  A Distributed Approach for Real Time 3D Modeling , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[19]  Bryan Chan,et al.  Shader algebra , 2004, SIGGRAPH 2004.

[20]  David K. McAllister,et al.  Fast matrix multiplies using graphics hardware , 2001, SC.

[21]  Anoop Gupta,et al.  The Design and Analysis of a Cache Architecture for Texture Mapping , 1997, ISCA.

[22]  Pat Hanrahan,et al.  Photon mapping on programmable graphics hardware , 2003, HWWS '03.

[23]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[24]  Dinesh Manocha,et al.  Fast and approximate stream mining of quantiles and frequencies using graphics processors , 2005, SIGMOD '05.

[25]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[26]  Daniel B. Horn,et al.  Assessment of Graphic Processing Units (GPUs) for Department of Defense (DoD) Digi , 2005 .

[27]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[28]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[29]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[30]  David H. Bailey A high-performance fast Fourier transform algorithm for the Cray-2 , 2004, The Journal of Supercomputing.

[31]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[32]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[33]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[34]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[35]  Rüdiger Westermann,et al.  UberFlow: a GPU-based particle engine , 2004, SIGGRAPH '04.

[36]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[37]  Keshav Pingali,et al.  Data-centric multi-level blocking , 1997, PLDI '97.

[38]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[39]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[40]  Joel Falcou,et al.  E.V.E., An Object Oriented SIMD Library , 2005, Scalable Comput. Pract. Exp..

[41]  A. Verri,et al.  A compact algorithm for rectification of stereo pairs , 2000 .

[42]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[43]  R. Tolimieri,et al.  Algorithms for Discrete Fourier Transform and Convolution , 1989 .

[44]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[45]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[46]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.