Improving Performance of Matrix Multiplication and FFT on GPU

In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.

[1]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[3]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[4]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Emilio L. Zapata,et al.  Memory Locality Exploitation Strategies for FFT on the CUDA Architecture , 2008, VECPAR.

[6]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .